To identify sections that are present in one language but not in another language version of an article, we use section titles as a proxy. It is difficult to compare two paragraphs of text in different languages to determine if the content is the same. Checking titles is also not easy, but we use some heuristics as a workaround.
Currently in cxserver, we use three strategies:
- Identify section titles and their corresponding titles in another language using similarity scores from an embedding model. This work is tracked at https://phabricator.wikimedia.org/T293511. We use crosslingual embedding models for this approach. After extracting and collecting titles from database dumps, we calculate cosine similarity across these titles and create a database of title pair mappings. A threshold score is set to filter only sufficiently good candidates. The source code for this approach is available at https://gitlab.wikimedia.org/mnz/section-alignment/
- Calculate section title alignment based on past translations by CX users. We use CX Parallel corpus dumps to extract all user translations for section titles. These pairs are expected to be better than other options since they have undergone human review. The extracted title pairs are also added to the section title pair database. See parse-cx-corpus.py
- Calculate frequent section titles in English, then use machine translation to get the translated title in the target language. This is only possible if we have a usable MT engine between English and the target language. See alignwithmt.js.
The results from all three approaches above are inserted into a database called the section title database.
There are two versions of this database:
- An SQLite database that is used in non-production settings. This is also the database where we insert new items whenever we run the above three steps.
- A production MySQL database where the content from the above SQLite database is exported. See https://phabricator.wikimedia.org/T306963 for our first export.
Known Limitations
- Crosslingual embedding works very poorly for non-Latin, low-resource languages. Trying some new crosslingual embedding models might improve this problem slightly, but not in a major way, as vector embedding efficiency for smaller languages remains a challenge.
- The second approach of using CX parallel corpus has a chicken-and-egg problem. We want translators to translate section titles to get the alignment, but to do that we need section alignment. When section titles are frequent titles we see in articles, this works well. But when there is no exact match, it fails.
- The third approach using MT only works when MT is available. It also has another problem - MT works better with full sentences, as NMT systems are poor at translating phrases. For example, "Introduction" might get translated to "Introduction to the Bible".
- All three approaches share a common problem. We are searching for alignments using exact string matching techniques. If any spelling variations or stylistic alternatives are present in the section title, all of the above approaches will fail to report a section alignment.
Proposal: Language Agnostic Sentence Similarity
Section title alignment is a bitext mining problem where we want to identify the corresponding pair for a given phrase or sentence from a set of candidates. We want to do this in a performant way and probably constrain the choices to a minimum.
The state-of-the-art model for bitext mining is the Sentence Transformer LaBSE model. It supports roughly 110 languages. LaBSE works less well for assessing the similarity of sentence pairs that are not translations of each other.
Recently, LaBSE and sentence transformer-based models in general can be used with CPUs, thanks to the addition of OpenVINO and ONNX backends. https://embed.toolforge.org/ hosts the LaBSE model with an OpenVINO backend.
Using that API, we can get a sentence similarity matrix for sections present in the source article and target article. By using a sensible threshold, we can find sections that are present or missing between article pairs.
A demo of this system is available at https://people.wikimedia.org/~santhosh/section-suggestions/. We can always compare the results with the existing API at https://cxserver.wikimedia.org/v2/suggest/sections/Japan/en/ml
The following example shows how to get a similarity matrix for section title sets of article pairs:
curl 'https://embed.toolforge.org/api/similarity_matrix' \
--compressed \
-X POST \
--data-raw $'{"texts1":["Etymology","History","Geography","Government and politics","Economy","Infrastructure","Demographics","Culture","See also","Notes","References","External links"],"texts2":["Geographie","Bev\xf6lkerung","Landesnamen","Geschichte","Politik","Wirtschaft","Infrastruktur","Kultur","Siehe auch","Literatur","Weblinks","Einzelnachweise"],"model_id":"sentence-transformers/LaBSE"}'The result is a 12x12 matrix with similarity scores (values between 0 and 1). We can find 10 matching sections with a threshold score of 0.567, while the current system detects only 8 sections.
Based on this proof concept:
- Evaluate the effectiveness of semantic similarity in comparison with existing API and report objective measures on improvement change
- Evaluate on very small languages and report the effectiveness
- If evaluation succeeds, replace the existing suggestion system with new one(this will require additional task for engineering work, especially on model inference scaling)