Expand section aligment to more languages, and share dumps
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	diego
	Oct 15 2021, 6:11 PM

Description

The goal of this task is to expand the existing section alignment prototype, creating alignments in more languages and share this DB with language team.

Hire and onboard a contractor.
Move the existing code to the PySpark.
Explore new cross-lingual embeddings system that help to escalate to more languages in a more efficient way.
Share DB with the Language team.

Update: The language team prefers to receive the alignments as dumps, to incorporate this on their pipelines instead of separated API.

Related Objects
Search...

Status	Assigned	Task
Open	None	T279064 Observations from first research study for Section Translation on Bengali Wikipedia
Open	None	T276212 Improve section mapping for Section Translation
Resolved	diego	T293511 Expand section aligment to more languages, and share dumps

Event Timeline

diego created this task.Oct 15 2021, 6:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 15 2021, 6:11 PM

Updates

@MunizaA has joined as contractor to work on this project.
We have started the on-boarding process.

Izno updated the task description. (Show Details)Oct 16 2021, 3:00 PM

Updates

@MunizaA 's on-boarding is going really fast. She is exploring our (py)Spark infrastructure.

Updates

@MunizaA is working in porting the existing code to PySpark.

Updates

@MunizaA has already moved the extraction pipeline to PySpark.
Now we will start working in the data process work.

Updates

@MunizaA is testing new languages models that could be more efficient and possible accurate than the FastText embeddings used in the previous experiments.

Updates

We obtained the first results with new language models. @MunizaA could you please report the numbers here?

diego updated the task description. (Show Details)Dec 4 2021, 12:42 PM

We experimented with multiple pre-trained models from sentence-transformers to find a multilingual model that can accurately and efficiently encode section headings. We've found that paraphrase-xlm-r-multilingual-v1 provides the most accurate and consistent results across multiple languages for our use case. It maps sentences to a 768 dimensional shared vector space and the resulting vectors can then be used to calculate cosine similarity between co-occurring sections.
The following results were obtained by running the same model evaluation experiments for different language pairs. We evaluate models by first aligning articles in two languages using their wikidata id. We then take the sections from those aligned articles and generate all possible combinations. The selected model is then used to encode these section pairs and calculate their similarity. We then rank these pairs by similarity for each section and check the rank of the true section translation (the one that's in our dataset). Note that these results only contain language pairs for which we had more than 20 records in our dataset.

source language	target language	sampled articles	aligned sections	pairs in dataset	pairs found	precision@1	precision@3	precision@5
enwiki	frwiki	309645	1674270	1055	842	0.4548	0.6377	0.6959
enwiki	ruwiki	283266	1830327	1009	842	0.4909	0.7090	0.7779
enwiki	jawiki	176062	1627190	488	306	0.3039	0.4477	0.5196
ruwiki	enwiki	283151	1842747	459	386	0.4768	0.6891	0.7409
arwiki	frwiki	134943	651561	211	145	0.4689	0.6348	0.6965
eswiki	enwiki	339313	1886851	255	232	0.5577	0.7370	0.7844
arwiki	enwiki	343110	855585	485	405	0.5753	0.7901	0.8518
eswiki	frwiki	235138	1142779	145	136	0.4779	0.7132	0.75
ruwiki	frwiki	199495	1107924	291	222	0.3828	0.6441	0.7387

The precision@n denotes the probability that, of all the aligned target headings for a section, the dataset translation was among the top n.

It is great to see this progress. Thanks for all this work, @MunizaA and @diego!

Updates

@MunizaA has developed the full pipeline to efficiently extract all the features used on the original model, such as link similarity and edit distance.
We are currently preparing the experiment to validate our results using the new Language model (to replace FastText).

Updates

@MunizaA has run the first experiments to compare the results with the new language model, with our old FastText-based model, obtaining promising results. (@MunizaA please share the new results here.)
The next steps are:
- Test the model for language pairs without training data.
- Estimate the time required to run the model in the 100+ languages supported by this new approach.

In order to assess the accuracy of our current language model, we tried to replicate the experiment that @diego had run with the FastText embeddings. This involved training a classifier on a portion of the ground truth and then using it to predict the similarity of the remaining section pairs in the ground truth. More specifically, we took our previously generated set of all possible section pairs for the 6 languages used in this experiment and for each pair extracted a bunch of features that describe that pair (the number of times the two sections in it occur together, how similar the links that they contain are on average etc.) which happen to be a subset of the features that Diego used. We then labelled all pairs that are found in the ground truth as 'True' and the rest as 'False'. A classifier using gradient boosting was trained on a portion of this data and was then used to classify the rest of it. We then dense ranked the results from this classifier to evaluate the probability of a pair from the ground truth ending up in the top 5 (precision @ 5).
The results from this experiment came out to be comparable to the previously documented ones. This means that we can use a multilingual model in place of FastText which is monolingual (meaning that while similar words within a language share similar vectors, translation words from different languages do not do so) eliminating the need to align vectors from two languages in a single vector space before they can be compared and expect similar results.
The following image depicts the results from the experiment mentioned above. Empty boxes in the chart represent cases where we didn't have enough ground truth.

diego moved this task from FY2021-22-Research-Oct-Dec to FY2021-22-Research-Jan-March on the Research board.Jan 17 2022, 1:49 AM

diego edited projects, added Research (FY2021-22-Research-Jan-March); removed Research (FY2021-22-Research-Oct-Dec).

Updates

We are analyzing the results showed above before deciding the new steps.

Updates

We have done manual sanity checks on the data extraction pipeline, confirming that is working properly.
Next steps will be to run the model in 20 new languages.

@Pginer-WMF or @santhosh could you please propose some language pairs that you will be able to manually evaluate? Please consider any language on this list.. Feel free to select a difficult pair, that give us a lower bound to understand the model precision on under-resourced languages.

Thanks again for this work and the updates. I reply to the specific question below:

In T293511#7641889, @diego wrote:

@Pginer-WMF or @santhosh could you please propose some language pairs that you will be able to manually evaluate? Please consider any language on this list.. Feel free to select a difficult pair, that give us a lower bound to understand the model precision on under-resourced languages.

I can help directly with any combination of English, Spanish and Catalan.

In the Language team we can easily find help to evaluate English-{Bengali, Catalan, Finnish, Greek, Gujarati, Hebrew, Indonesian, Malayalam, and Russian}. Also any combination of English, Hebrew, Russian and Catalan.

Are any combinations of the above relevant for testing this well? If not, I can check to get a more detail listing since I'm sure I haven't covered all linguistic capabilities of the team in the above list.

Updates

We are preparing the data on some of the language pairs suggested by @Pginer-WMF
@MunizaA has uploaded the code here: https://gitlab.wikimedia.org/mnz/section-alignment/-/tree/muniza-notebooks/notebooks

diego updated the task description. (Show Details)Feb 4 2022, 9:53 PM

Updates

@MunizaA has uploaded this sample files containing several languages Each of them contains the top-200 most frequent sections in the source language.
@Pginer-WMF , please have a look on them. Keep in mind that we are focusing in recall more than precision. By now, we are showing the top-20 most similar target sections, per source section.
I'll coordinate a meeting in the following days to discuss how to tune these results.

In T293511#7705375, @diego wrote:

@Pginer-WMF , please have a look on them. Keep in mind that we are focusing in recall more than precision. By now, we are showing the top-20 most similar target sections, per source section.

Looking at some of the results, the recall approach makes sense in general. It helps to identify sections that cover the same contents (even if using synonyms or made a typo) or have a significant overlap ("References" vs. "References and notes").

In some cases the mappings connect what seem totally unrelated concepts. This tends to happen with the lowest scores (and not for all sections) so it may be worth setting some threshold to exclude these. For example, (from ca-en) the "Naixements" (Births) section is mapped to "In fiction", "November" and "See also". This is problematic because an article that is missing the "Births" section may not get it suggested for translation because it has a "See also" section and our system thinks they are equivalent.

Thinking about the right balance, between the two problems the most concerning for our users I think it is when a section is proposed to be translated (i.e., identified as missing) but the contents are already there (i.e., it is actually present in some form we were not able to map). But I'm not sure how they would perceive the opposite problem (i.e. section shown as present when it is actually missing) if it starts to become more frequent.

We can discuss more on how to adjust the right balance, but I think that the kind of mappings shown could help improve some of the issues we have identified in the past (T283817).

Updates

With @MunizaA we have annotated data in Spanish to English and Urdu to English.
- We found that popularity of sections (amount of articles they appear) has a huge impact on the results' quality.
- While for popular sections there are multiple possible translations, the most infrequent ones usually has 1 or 2.
- We are trying to improve the model to address these issues.
We are also analyzing how to use MT to improve the results.

@Pginer-WMF you mentioned that for using the MT services that requires a key, we should do it from our servers. Does this means that we need to ssh to some machine and work from there or there is an end-point that we could use for this? if that is the case, could you please provide an example?

In T293511#7722600, @diego wrote:

@Pginer-WMF you mentioned that for using the MT services that requires a key, we should do it from our servers. Does this means that we need to ssh to some machine and work from there or there is an end-point that we could use for this? if that is the case, could you please provide an example?

We have an open API https://cxserver.wikimedia.org/v2/translate/FROM/TO/PROVIDER ( https://codepen.io/santhoshtr/pen/zjMMrG has a working example ). But this can be only used with Free and Opensource MT engines like Apertium that WMF hosts. MT engines like Google are metered services and we restrict it to use only internally in cxserver. It is possible to use them by creating a new set of secret keys for this purpose with some predefined quota but that need to be discussed privately(not in phabricator).

MGerlach mentioned this in T279519: Add a link: algorithm improvements: Avoid recommending links in sections that usually don't have links.Feb 22 2022, 3:23 PM

Updates

We are fine tuning the model.

diego renamed this task from Expand section aligment to more languages, and create an API to Expand section aligment to more languages, and share dumps.Mar 7 2022, 11:32 AM

diego updated the task description. (Show Details)

Updates

We are working in applying the model at scale. @MunizaA has been experimenting with native spark libraries to see if is possible to replace external dependencies. The quality of firsts results are not satisfactory, so we are exploring alternatives.

Updates

We decided to go back to the XGBoost based model, because the results were better than using the Spark implementation.
We noticed a decrease on precision when considering under-resourced languages. Our hypothesis is that the quality of embeddings created by M-Bert is not very high. We decided to create a second model, language-agnostic, and then compare the results. Our intuition is that for some languages the language agnostic model will be better.
We plan to release all these results at the end of next week.

Updates

We have tested our model model on the CX dataset (sections translations done using the CX Tool).
Results are showing a good performance. @MunizaA please report the precision@5 for the top-100 languages pairs.
Now, we run the alignments for all the languages, and the results will be ready early next week.

@Pginer-WMF / @santhosh : The results per language are around 1G each. For example, 'es' to all other languages is 0.9G. Putting all languages together in one file would be impossible, so SQLITE does not seams to be a feasible solution. We could split the data in one file per language pair, that would be over 2K files, for example 2K different csv files or sqlite files. Another option is to store this results in HIVE, or directly in Parquet. What would you prefer?

The following results include the top 100 language pairs by number of section pairs tested. The precision here denotes the probability that, of all the aligned target sections for a source section in our extracted data, the cx dataset translation was among the top 5. Please note that any source sections occurring more than once per (source language, target language) in the cx dataset were counted as one pair and tested by checking if any of the corresponding targets ended up among the top 5.

Source language	Target language	Precision @ 5	Pairs tested
enwiki	eswiki	0.970	12988
enwiki	frwiki	0.939	9165
enwiki	arwiki	0.937	8456
enwiki	viwiki	0.946	6054
ruwiki	ukwiki	0.986	5980
ruwiki	bawiki	0.919	5382
enwiki	jawiki	0.906	5328
enwiki	zhwiki	0.915	5153
enwiki	itwiki	0.941	5039
enwiki	ukwiki	0.944	4934
enwiki	ptwiki	0.964	4691
enwiki	trwiki	0.953	4246
enwiki	ruwiki	0.912	4110
enwiki	hewiki	0.925	4062
enwiki	idwiki	0.973	3495
enwiki	fawiki	0.946	3402
enwiki	rowiki	0.964	3048
enwiki	bnwiki	0.962	2832
enwiki	tawiki	0.963	2707
enwiki	elwiki	0.946	2685
enwiki	cawiki	0.940	2604
eswiki	cawiki	0.971	2296
frwiki	ocwiki	0.989	2094
enwiki	dewiki	0.876	1884
enwiki	pawiki	0.982	1781
enwiki	mlwiki	0.952	1632
enwiki	cswiki	0.917	1466
enwiki	kowiki	0.905	1375
enwiki	mkwiki	0.966	1308
enwiki	srwiki	0.928	1212
enwiki	sqwiki	0.971	1178
enwiki	nlwiki	0.925	1176
enwiki	mswiki	0.957	1174
enwiki	afwiki	0.977	1089
enwiki	huwiki	0.897	1041
dewiki	frwiki	0.852	1026
frwiki	eswiki	0.920	995
ruwiki	hywiki	0.959	991
frwiki	enwiki	0.918	922
dewiki	enwiki	0.895	893
enwiki	urwiki	0.948	828
enwiki	plwiki	0.891	824
enwiki	tewiki	0.953	813
eswiki	enwiki	0.913	797
ukwiki	ruwiki	0.958	754
jawiki	zhwiki	0.856	750
enwiki	fiwiki	0.888	732
enwiki	thwiki	0.920	679
enwiki	hiwiki	0.938	659
enwiki	dawiki	0.933	658
frwiki	itwiki	0.921	648
eswiki	euwiki	0.946	635
enwiki	slwiki	0.959	631
dewiki	itwiki	0.872	626
enwiki	cywiki	0.955	616
ruwiki	hewiki	0.874	595
ruwiki	enwiki	0.906	595
enwiki	tlwiki	0.939	594
eswiki	glwiki	0.927	587
enwiki	orwiki	0.926	582
enwiki	svwiki	0.930	568
enwiki	kawiki	0.952	568
enwiki	bgwiki	0.929	564
ruwiki	bewiki	0.978	544
enwiki	hywiki	0.918	538
enwiki	mywiki	0.929	535
eswiki	frwiki	0.882	534
enwiki	guwiki	0.958	524
frwiki	cawiki	0.922	523
enwiki	knwiki	0.965	510
enwiki	glwiki	0.901	506
dewiki	nlwiki	0.876	499
ruwiki	ttwiki	0.950	497
cawiki	eswiki	0.961	491
enwiki	hawiki	0.924	487
eswiki	ptwiki	0.960	475
dewiki	eswiki	0.870	453
enwiki	ckbwiki	0.642	450
frwiki	arwiki	0.824	449
plwiki	ukwiki	0.918	426
itwiki	frwiki	0.903	423
zhwiki	enwiki	0.899	414
enwiki	siwiki	0.951	412
enwiki	euwiki	0.926	404
enwiki	hrwiki	0.948	400
itwiki	enwiki	0.932	385
ruwiki	tgwiki	0.916	382
enwiki	jvwiki	0.866	372
itwiki	eswiki	0.923	364
enwiki	eowiki	0.893	355
enwiki	etwiki	0.915	354
dewiki	ukwiki	0.852	352
jawiki	kowiki	0.937	350
ptwiki	enwiki	0.935	336
ruwiki	kkwiki	0.955	332
frwiki	ptwiki	0.927	329
enwiki	gawiki	0.966	323
enwiki	mrwiki	0.944	322
ruwiki	sahwiki	0.729	321
enwiki	bswiki	0.974	312

In T293511#7825271, @diego wrote:

@Pginer-WMF / @santhosh : The results per language are around 1G each. For example, 'es' to all other languages is 0.9G. Putting all languages together in one file would be impossible, so SQLITE does not seams to be a feasible solution. We could split the data in one file per language pair, that would be over 2K files, for example 2K different csv files or sqlite files. Another option is to store this results in HIVE, or directly in Parquet. What would you prefer?

I'd let @santhosh comment on the technical solutions. From the product perspective, one consideration is that all language pairs are not used with the same frequency when translating, so we can select the most frequently used ones if we need to reduce size. For example, looking at the stats we see that translations to Arabic use mainly English as source (93%) with French (4%) and German (1%) following, and it may not be a priority supporting Dutch to Arabic (0.04%).

leila moved this task from FY2021-22-Research-Jan-March to FY2021-22-Research-April-June on the Research board.Apr 8 2022, 2:35 AM

leila edited projects, added Research (FY2021-22-Research-April-June); removed Research (FY2021-22-Research-Jan-March).

Updates

We have published the alignments for 205 languages here.
Each folder contains the alignments from that language to all others. For example 'enwiki' contains the alignments from English to all the other wikis.
The format is SQLite. @santhosh could you confirm you are able to read the files?
We are working on the algorithm and output documentation.

diego updated the task description. (Show Details)Apr 8 2022, 4:30 PM

Updates

We have published the documentation about this project here.
All code and data is available and linked on the documentation page.

Thanks @diego and @MunizaA. I downloaded some samples and able to open the databases. The columns were easy to interpret too. I think the database size can be reduced drastically by removing irrelevant records(records with very low probability). For example,

In mlwiki_aligned_sections_2022-02.sqlite there are 327276 records for mlwiki -> enwiki. If I apply probabiliy greater than 0.90 filter, there are only 4414 records. That is just 1.3% of records. For practicaly puposes low probability scores can be ignored.
For example, here the mapping for ഇലക്ട്രോണിക് വാച്ചുകൾ(Electronic watches) and കേന്ദ്ര മന്ത്രി(Central government minister) has no targets, but candidates with very low probability is given. For production, I don't think we need those records in database-it just slows the query performance.

So I was wondering if we can do some filter query across all these databases at https://analytics.wikimedia.org/published/datasets/one-off/section_alignment/ and create a database that is good enough for production. That database should also merge with the current database we have for the section alignment. We(language team) can do that processing. It just takes time to dowload and run filtering through this large databases. Or you can apply similar filtering in your scripts to generate databases with less number of records. That could save lot of space , download time for published data-And may also make the data provided more meaningful. What would be the best approach in your opinion?

Another consideration is target languages we need in database. For example. mlwiki->iswiki(Malayalam to Islandic) exists in database with 122 items with probability score above 0.9 out of 23216 total items. The above mentioned filtering would remove all of those low scores. The chances that somebody translate from Malayalam to Islandic is very low.

Hi @santhosh, @MunizaA has created a new dump just with pairs with probabilities > 0.9, you can find here. We think that 0.9 might be to high, let us know if you want to try with other values between 0.5 and 0.9.

Thanks. I see that the total database for all pairs is just 485 MB. That is a great improvement. I was using 0.9 as example in previous comment(sorry, if I was not clear enough). I think we can have results with a slightly lower scrores too, as you mentioned - may be 0.5 to 0.9. I don't think it will cause larger database size. The probability and rank needs to be retained in the database table too.

For Cxserver, we have to use this database along with the current database we have. The existing database relies on frequency as confidence factor as it comes from CX corpus. I am considering keeping these two databases, first query the corpus based database and then this database produced by you.

Hi @santhosh, I've restored the probability and rank columns for the database and uploaded the new version here. The directory also contains databases with lower threshold scores (0.5 - 0.8). Please let me know if you have any questions, thanks.

Pginer-WMF added a parent task: T276212: Improve section mapping for Section Translation.Apr 27 2022, 8:29 AM

Pginer-WMF mentioned this in T306963: Integrate new section mapping database.Apr 27 2022, 8:40 AM

Thanks. We created a ticket T306963: Integrate new section mapping database to use this database and integrate with cxserver. We may have further questions when we work on the integration.

diego closed this task as Resolved.Apr 27 2022, 11:11 AM

Pginer-WMF added projects: Language-Team (Language-2022-April-June), SectionTranslation.Apr 27 2022, 11:44 AM

Pginer-WMF mentioned this in T307325: Section Identification & Section Topics.May 17 2022, 10:12 AM

	F35056103: image.png
	Apr 18 2022, 6:20 AM

	F34915935: precision.png
	Jan 12 2022, 9:51 PM

Expand section aligment to more languages, and share dumpsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Expand section aligment to more languages, and share dumps
Closed, ResolvedPublic
Actions

Related Objects
Search...