Page MenuHomePhabricator

MunizaA (Muniza)
User

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Mar 31 2021, 5:59 AM (71 w, 6 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
MunizaA [ Global Accounts ]

Recent Activity

Thu, Aug 4

MunizaA created P32283 mnz SSH public key for WMF production.
Thu, Aug 4, 1:35 PM

Apr 26 2022

MunizaA added a comment to T293511: Expand section aligment to more languages, and share dumps.

Hi @santhosh, I've restored the probability and rank columns for the database and uploaded the new version here. The directory also contains databases with lower threshold scores (0.5 - 0.8). Please let me know if you have any questions, thanks.

Apr 26 2022, 5:07 PM · SectionTranslation, Language-Team (Language-2022-April-June), Research (FY2021-22-Research-April-June)

Apr 3 2022

MunizaA added a comment to T293511: Expand section aligment to more languages, and share dumps.

The following results include the top 100 language pairs by number of section pairs tested. The precision here denotes the probability that, of all the aligned target sections for a source section in our extracted data, the cx dataset translation was among the top 5. Please note that any source sections occurring more than once per (source language, target language) in the cx dataset were counted as one pair and tested by checking if any of the corresponding targets ended up among the top 5.

Apr 3 2022, 5:15 PM · SectionTranslation, Language-Team (Language-2022-April-June), Research (FY2021-22-Research-April-June)

Jan 12 2022

MunizaA added a comment to T293511: Expand section aligment to more languages, and share dumps.

In order to assess the accuracy of our current language model, we tried to replicate the experiment that @diego had run with the FastText embeddings. This involved training a classifier on a portion of the ground truth and then using it to predict the similarity of the remaining section pairs in the ground truth. More specifically, we took our previously generated set of all possible section pairs for the 6 languages used in this experiment and for each pair extracted a bunch of features that describe that pair (the number of times the two sections in it occur together, how similar the links that they contain are on average etc.) which happen to be a subset of the features that Diego used. We then labelled all pairs that are found in the ground truth as 'True' and the rest as 'False'. A classifier using gradient boosting was trained on a portion of this data and was then used to classify the rest of it. We then dense ranked the results from this classifier to evaluate the probability of a pair from the ground truth ending up in the top 5 (precision @ 5).
The results from this experiment came out to be comparable to the previously documented ones. This means that we can use a multilingual model in place of FastText which is monolingual (meaning that while similar words within a language share similar vectors, translation words from different languages do not do so) eliminating the need to align vectors from two languages in a single vector space before they can be compared and expect similar results.
The following image depicts the results from the experiment mentioned above. Empty boxes in the chart represent cases where we didn't have enough ground truth.

precision.png (432×720 px, 15 KB)

Jan 12 2022, 9:51 PM · SectionTranslation, Language-Team (Language-2022-April-June), Research (FY2021-22-Research-April-June)

Dec 6 2021

MunizaA added a comment to T293511: Expand section aligment to more languages, and share dumps.

We experimented with multiple pre-trained models from sentence-transformers to find a multilingual model that can accurately and efficiently encode section headings. We've found that paraphrase-xlm-r-multilingual-v1 provides the most accurate and consistent results across multiple languages for our use case. It maps sentences to a 768 dimensional shared vector space and the resulting vectors can then be used to calculate cosine similarity between co-occurring sections.
The following results were obtained by running the same model evaluation experiments for different language pairs. We evaluate models by first aligning articles in two languages using their wikidata id. We then take the sections from those aligned articles and generate all possible combinations. The selected model is then used to encode these section pairs and calculate their similarity. We then rank these pairs by similarity for each section and check the rank of the true section translation (the one that's in our dataset). Note that these results only contain language pairs for which we had more than 20 records in our dataset.

Dec 6 2021, 11:49 AM · SectionTranslation, Language-Team (Language-2022-April-June), Research (FY2021-22-Research-April-June)

Oct 13 2021

MunizaA added a comment to T292955: Requesting access to Analytic Cluster for Muniza.

@MunizaA can you confirm that this wikitech user is you? https://ldap.toolforge.org/user/mnz

Also would you rather have mnza0001@gmail.com (from that wikitech account) or munaslam001@gmail.com (from this ticket) associated with this shell account?

Oct 13 2021, 3:38 PM · SRE, SRE-Access-Requests
MunizaA added a comment to T292955: Requesting access to Analytic Cluster for Muniza.

@CDanis I've signed it now. Thanks!

Oct 13 2021, 9:13 AM · SRE, SRE-Access-Requests

Oct 12 2021

MunizaA updated the task description for T292955: Requesting access to Analytic Cluster for Muniza.
Oct 12 2021, 8:25 AM · SRE, SRE-Access-Requests
MunizaA created P17452 MunizaA SSH public key for WMF production.
Oct 12 2021, 8:21 AM

Mar 31 2021

MunizaA updated MunizaA.
Mar 31 2021, 6:12 AM