Page MenuHomePhabricator

Expand section aligment to more languages, and share dumps
Closed, ResolvedPublic

Description

The goal of this task is to expand the existing section alignment prototype, creating alignments in more languages and share this DB with language team.

  • Hire and onboard a contractor.
  • Move the existing code to the PySpark.
  • Explore new cross-lingual embeddings system that help to escalate to more languages in a more efficient way.
  • Share DB with the Language team.

Update: The language team prefers to receive the alignments as dumps, to incorporate this on their pipelines instead of separated API.

Event Timeline

Updates

  • @MunizaA has joined as contractor to work on this project.
  • We have started the on-boarding process.

Updates

  • @MunizaA 's on-boarding is going really fast. She is exploring our (py)Spark infrastructure.

Updates

  • @MunizaA is working in porting the existing code to PySpark.

Updates

  • @MunizaA has already moved the extraction pipeline to PySpark.
  • Now we will start working in the data process work.

Updates

  • @MunizaA is testing new languages models that could be more efficient and possible accurate than the FastText embeddings used in the previous experiments.

Updates

  • We obtained the first results with new language models. @MunizaA could you please report the numbers here?

We experimented with multiple pre-trained models from sentence-transformers to find a multilingual model that can accurately and efficiently encode section headings. We've found that paraphrase-xlm-r-multilingual-v1 provides the most accurate and consistent results across multiple languages for our use case. It maps sentences to a 768 dimensional shared vector space and the resulting vectors can then be used to calculate cosine similarity between co-occurring sections.
The following results were obtained by running the same model evaluation experiments for different language pairs. We evaluate models by first aligning articles in two languages using their wikidata id. We then take the sections from those aligned articles and generate all possible combinations. The selected model is then used to encode these section pairs and calculate their similarity. We then rank these pairs by similarity for each section and check the rank of the true section translation (the one that's in our dataset). Note that these results only contain language pairs for which we had more than 20 records in our dataset.

source languagetarget languagesampled articlesaligned sectionspairs in datasetpairs foundprecision@1precision@3precision@5
enwikifrwiki309645167427010558420.45480.63770.6959
enwikiruwiki283266183032710098420.49090.70900.7779
enwikijawiki17606216271904883060.30390.44770.5196
ruwikienwiki28315118427474593860.47680.68910.7409
arwikifrwiki1349436515612111450.46890.63480.6965
eswikienwiki33931318868512552320.55770.73700.7844
arwikienwiki3431108555854854050.57530.79010.8518
eswikifrwiki23513811427791451360.47790.71320.75
ruwikifrwiki19949511079242912220.38280.64410.7387

The precision@n denotes the probability that, of all the aligned target headings for a section, the dataset translation was among the top n.

It is great to see this progress. Thanks for all this work, @MunizaA and @diego!

Updates

  • @MunizaA has developed the full pipeline to efficiently extract all the features used on the original model, such as link similarity and edit distance.
  • We are currently preparing the experiment to validate our results using the new Language model (to replace FastText).

Updates

  • @MunizaA has run the first experiments to compare the results with the new language model, with our old FastText-based model, obtaining promising results. (@MunizaA please share the new results here.)
  • The next steps are:
    • Test the model for language pairs without training data.
    • Estimate the time required to run the model in the 100+ languages supported by this new approach.

In order to assess the accuracy of our current language model, we tried to replicate the experiment that @diego had run with the FastText embeddings. This involved training a classifier on a portion of the ground truth and then using it to predict the similarity of the remaining section pairs in the ground truth. More specifically, we took our previously generated set of all possible section pairs for the 6 languages used in this experiment and for each pair extracted a bunch of features that describe that pair (the number of times the two sections in it occur together, how similar the links that they contain are on average etc.) which happen to be a subset of the features that Diego used. We then labelled all pairs that are found in the ground truth as 'True' and the rest as 'False'. A classifier using gradient boosting was trained on a portion of this data and was then used to classify the rest of it. We then dense ranked the results from this classifier to evaluate the probability of a pair from the ground truth ending up in the top 5 (precision @ 5).
The results from this experiment came out to be comparable to the previously documented ones. This means that we can use a multilingual model in place of FastText which is monolingual (meaning that while similar words within a language share similar vectors, translation words from different languages do not do so) eliminating the need to align vectors from two languages in a single vector space before they can be compared and expect similar results.
The following image depicts the results from the experiment mentioned above. Empty boxes in the chart represent cases where we didn't have enough ground truth.

precision.png (432×720 px, 15 KB)

Updates

  • We are analyzing the results showed above before deciding the new steps.

Updates

  • We have done manual sanity checks on the data extraction pipeline, confirming that is working properly.
  • Next steps will be to run the model in 20 new languages.

@Pginer-WMF or @santhosh could you please propose some language pairs that you will be able to manually evaluate? Please consider any language on this list.. Feel free to select a difficult pair, that give us a lower bound to understand the model precision on under-resourced languages.

Thanks again for this work and the updates. I reply to the specific question below:

@Pginer-WMF or @santhosh could you please propose some language pairs that you will be able to manually evaluate? Please consider any language on this list.. Feel free to select a difficult pair, that give us a lower bound to understand the model precision on under-resourced languages.

I can help directly with any combination of English, Spanish and Catalan.

In the Language team we can easily find help to evaluate English-{Bengali, Catalan, Finnish, Greek, Gujarati, Hebrew, Indonesian, Malayalam, and Russian}. Also any combination of English, Hebrew, Russian and Catalan.

Are any combinations of the above relevant for testing this well? If not, I can check to get a more detail listing since I'm sure I haven't covered all linguistic capabilities of the team in the above list.

Updates

Updates

  • @MunizaA has uploaded this sample files containing several languages Each of them contains the top-200 most frequent sections in the source language.
  • @Pginer-WMF , please have a look on them. Keep in mind that we are focusing in recall more than precision. By now, we are showing the top-20 most similar target sections, per source section.
  • I'll coordinate a meeting in the following days to discuss how to tune these results.
  • @Pginer-WMF , please have a look on them. Keep in mind that we are focusing in recall more than precision. By now, we are showing the top-20 most similar target sections, per source section.

Looking at some of the results, the recall approach makes sense in general. It helps to identify sections that cover the same contents (even if using synonyms or made a typo) or have a significant overlap ("References" vs. "References and notes").

In some cases the mappings connect what seem totally unrelated concepts. This tends to happen with the lowest scores (and not for all sections) so it may be worth setting some threshold to exclude these. For example, (from ca-en) the "Naixements" (Births) section is mapped to "In fiction", "November" and "See also". This is problematic because an article that is missing the "Births" section may not get it suggested for translation because it has a "See also" section and our system thinks they are equivalent.

Thinking about the right balance, between the two problems the most concerning for our users I think it is when a section is proposed to be translated (i.e., identified as missing) but the contents are already there (i.e., it is actually present in some form we were not able to map). But I'm not sure how they would perceive the opposite problem (i.e. section shown as present when it is actually missing) if it starts to become more frequent.

We can discuss more on how to adjust the right balance, but I think that the kind of mappings shown could help improve some of the issues we have identified in the past (T283817).

Updates

  • With @MunizaA we have annotated data in Spanish to English and Urdu to English.
    • We found that popularity of sections (amount of articles they appear) has a huge impact on the results' quality.
    • While for popular sections there are multiple possible translations, the most infrequent ones usually has 1 or 2.
    • We are trying to improve the model to address these issues.
  • We are also analyzing how to use MT to improve the results.

@Pginer-WMF you mentioned that for using the MT services that requires a key, we should do it from our servers. Does this means that we need to ssh to some machine and work from there or there is an end-point that we could use for this? if that is the case, could you please provide an example?

@Pginer-WMF you mentioned that for using the MT services that requires a key, we should do it from our servers. Does this means that we need to ssh to some machine and work from there or there is an end-point that we could use for this? if that is the case, could you please provide an example?

We have an open API https://cxserver.wikimedia.org/v2/translate/FROM/TO/PROVIDER ( https://codepen.io/santhoshtr/pen/zjMMrG has a working example ). But this can be only used with Free and Opensource MT engines like Apertium that WMF hosts. MT engines like Google are metered services and we restrict it to use only internally in cxserver. It is possible to use them by creating a new set of secret keys for this purpose with some predefined quota but that need to be discussed privately(not in phabricator).

Updates

  • We are fine tuning the model.
diego renamed this task from Expand section aligment to more languages, and create an API to Expand section aligment to more languages, and share dumps.Mar 7 2022, 11:32 AM
diego updated the task description. (Show Details)

Updates

  • We are working in applying the model at scale. @MunizaA has been experimenting with native spark libraries to see if is possible to replace external dependencies. The quality of firsts results are not satisfactory, so we are exploring alternatives.

Updates

  • We decided to go back to the XGBoost based model, because the results were better than using the Spark implementation.
  • We noticed a decrease on precision when considering under-resourced languages. Our hypothesis is that the quality of embeddings created by M-Bert is not very high. We decided to create a second model, language-agnostic, and then compare the results. Our intuition is that for some languages the language agnostic model will be better.
  • We plan to release all these results at the end of next week.

Updates

  • We have tested our model model on the CX dataset (sections translations done using the CX Tool).
  • Results are showing a good performance. @MunizaA please report the precision@5 for the top-100 languages pairs.
  • Now, we run the alignments for all the languages, and the results will be ready early next week.

@Pginer-WMF / @santhosh : The results per language are around 1G each. For example, 'es' to all other languages is 0.9G. Putting all languages together in one file would be impossible, so SQLITE does not seams to be a feasible solution. We could split the data in one file per language pair, that would be over 2K files, for example 2K different csv files or sqlite files. Another option is to store this results in HIVE, or directly in Parquet. What would you prefer?

The following results include the top 100 language pairs by number of section pairs tested. The precision here denotes the probability that, of all the aligned target sections for a source section in our extracted data, the cx dataset translation was among the top 5. Please note that any source sections occurring more than once per (source language, target language) in the cx dataset were counted as one pair and tested by checking if any of the corresponding targets ended up among the top 5.

Source languageTarget languagePrecision @ 5Pairs tested
enwikieswiki0.97012988
enwikifrwiki0.9399165
enwikiarwiki0.9378456
enwikiviwiki0.9466054
ruwikiukwiki0.9865980
ruwikibawiki0.9195382
enwikijawiki0.9065328
enwikizhwiki0.9155153
enwikiitwiki0.9415039
enwikiukwiki0.9444934
enwikiptwiki0.9644691
enwikitrwiki0.9534246
enwikiruwiki0.9124110
enwikihewiki0.9254062
enwikiidwiki0.9733495
enwikifawiki0.9463402
enwikirowiki0.9643048
enwikibnwiki0.9622832
enwikitawiki0.9632707
enwikielwiki0.9462685
enwikicawiki0.9402604
eswikicawiki0.9712296
frwikiocwiki0.9892094
enwikidewiki0.8761884
enwikipawiki0.9821781
enwikimlwiki0.9521632
enwikicswiki0.9171466
enwikikowiki0.9051375
enwikimkwiki0.9661308
enwikisrwiki0.9281212
enwikisqwiki0.9711178
enwikinlwiki0.9251176
enwikimswiki0.9571174
enwikiafwiki0.9771089
enwikihuwiki0.8971041
dewikifrwiki0.8521026
frwikieswiki0.920995
ruwikihywiki0.959991
frwikienwiki0.918922
dewikienwiki0.895893
enwikiurwiki0.948828
enwikiplwiki0.891824
enwikitewiki0.953813
eswikienwiki0.913797
ukwikiruwiki0.958754
jawikizhwiki0.856750
enwikifiwiki0.888732
enwikithwiki0.920679
enwikihiwiki0.938659
enwikidawiki0.933658
frwikiitwiki0.921648
eswikieuwiki0.946635
enwikislwiki0.959631
dewikiitwiki0.872626
enwikicywiki0.955616
ruwikihewiki0.874595
ruwikienwiki0.906595
enwikitlwiki0.939594
eswikiglwiki0.927587
enwikiorwiki0.926582
enwikisvwiki0.930568
enwikikawiki0.952568
enwikibgwiki0.929564
ruwikibewiki0.978544
enwikihywiki0.918538
enwikimywiki0.929535
eswikifrwiki0.882534
enwikiguwiki0.958524
frwikicawiki0.922523
enwikiknwiki0.965510
enwikiglwiki0.901506
dewikinlwiki0.876499
ruwikittwiki0.950497
cawikieswiki0.961491
enwikihawiki0.924487
eswikiptwiki0.960475
dewikieswiki0.870453
enwikickbwiki0.642450
frwikiarwiki0.824449
plwikiukwiki0.918426
itwikifrwiki0.903423
zhwikienwiki0.899414
enwikisiwiki0.951412
enwikieuwiki0.926404
enwikihrwiki0.948400
itwikienwiki0.932385
ruwikitgwiki0.916382
enwikijvwiki0.866372
itwikieswiki0.923364
enwikieowiki0.893355
enwikietwiki0.915354
dewikiukwiki0.852352
jawikikowiki0.937350
ptwikienwiki0.935336
ruwikikkwiki0.955332
frwikiptwiki0.927329
enwikigawiki0.966323
enwikimrwiki0.944322
ruwikisahwiki0.729321
enwikibswiki0.974312

@Pginer-WMF / @santhosh : The results per language are around 1G each. For example, 'es' to all other languages is 0.9G. Putting all languages together in one file would be impossible, so SQLITE does not seams to be a feasible solution. We could split the data in one file per language pair, that would be over 2K files, for example 2K different csv files or sqlite files. Another option is to store this results in HIVE, or directly in Parquet. What would you prefer?

I'd let @santhosh comment on the technical solutions. From the product perspective, one consideration is that all language pairs are not used with the same frequency when translating, so we can select the most frequently used ones if we need to reduce size. For example, looking at the stats we see that translations to Arabic use mainly English as source (93%) with French (4%) and German (1%) following, and it may not be a priority supporting Dutch to Arabic (0.04%).

Updates

  • We have published the alignments for 205 languages here.
  • Each folder contains the alignments from that language to all others. For example 'enwiki' contains the alignments from English to all the other wikis.
  • The format is SQLite. @santhosh could you confirm you are able to read the files?
  • We are working on the algorithm and output documentation.

Updates

  • We have published the documentation about this project here.
  • All code and data is available and linked on the documentation page.

Thanks @diego and @MunizaA. I downloaded some samples and able to open the databases. The columns were easy to interpret too. I think the database size can be reduced drastically by removing irrelevant records(records with very low probability). For example,

In mlwiki_aligned_sections_2022-02.sqlite there are 327276 records for mlwiki -> enwiki. If I apply probabiliy greater than 0.90 filter, there are only 4414 records. That is just 1.3% of records. For practicaly puposes low probability scores can be ignored.
For example, here the mapping for ഇലക്ട്രോണിക് വാച്ചുകൾ(Electronic watches) and കേന്ദ്ര മന്ത്രി(Central government minister) has no targets, but candidates with very low probability is given. For production, I don't think we need those records in database-it just slows the query performance.

image.png (758×1 px, 186 KB)

So I was wondering if we can do some filter query across all these databases at https://analytics.wikimedia.org/published/datasets/one-off/section_alignment/ and create a database that is good enough for production. That database should also merge with the current database we have for the section alignment. We(language team) can do that processing. It just takes time to dowload and run filtering through this large databases. Or you can apply similar filtering in your scripts to generate databases with less number of records. That could save lot of space , download time for published data-And may also make the data provided more meaningful. What would be the best approach in your opinion?

Another consideration is target languages we need in database. For example. mlwiki->iswiki(Malayalam to Islandic) exists in database with 122 items with probability score above 0.9 out of 23216 total items. The above mentioned filtering would remove all of those low scores. The chances that somebody translate from Malayalam to Islandic is very low.

Hi @santhosh, @MunizaA has created a new dump just with pairs with probabilities > 0.9, you can find here. We think that 0.9 might be to high, let us know if you want to try with other values between 0.5 and 0.9.

Thanks. I see that the total database for all pairs is just 485 MB. That is a great improvement. I was using 0.9 as example in previous comment(sorry, if I was not clear enough). I think we can have results with a slightly lower scrores too, as you mentioned - may be 0.5 to 0.9. I don't think it will cause larger database size. The probability and rank needs to be retained in the database table too.

For Cxserver, we have to use this database along with the current database we have. The existing database relies on frequency as confidence factor as it comes from CX corpus. I am considering keeping these two databases, first query the corpus based database and then this database produced by you.

Hi @santhosh, I've restored the probability and rank columns for the database and uploaded the new version here. The directory also contains databases with lower threshold scores (0.5 - 0.8). Please let me know if you have any questions, thanks.

Thanks. We created a ticket T306963: Integrate new section mapping database to use this database and integrate with cxserver. We may have further questions when we work on the integration.