Thu, Oct 10
Here a summary of the work done: https://meta.wikimedia.org/wiki/Research:Disinformation_Literature_Review
Tue, Oct 8
@Nuria the entropy approach looks very cool, thanks for sharing.
We (research) will be supporting @ssingh on his work related to this problem, especially focused in censorship.
Fri, Sep 27
Fri, Sep 20
Sep 16 2019
@santhosh , I've never tried that (I understand that docker files are kind of virtual environment, but honestly, I've never used it). We can try, but remember that the person will need access to our spark cluster. Do you know if the docker environment can connect with Yarn?
Aug 28 2019
@diego please review below as well since you worked with GII folks during the past iteration:
Looks ok to me.
Aug 14 2019
Aug 13 2019
Aug 7 2019
Aug 5 2019
Aug 1 2019
Jul 29 2019
Jul 11 2019
This has been solved here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Results
Check an example here: https://secrec.wmflabs.org/API/alignment/en/ja/Work
Jul 10 2019
@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?
Considering the feedback obtained in T225136, we conclude that the prioritization should be adapted to the characteristic of the editor being assisted. We can split editors in two disjoint groups, generating two different types of recommendations:
Jul 5 2019
I think this use-case highlight the need for a canonical (standanrized) cross-lingual topic model, that we could all use as the reference for all the projects within the WMF.
Jun 13 2019
oh! I see, that number is distance, so 0 would be perfect match, 1 is not matching at all. I've already put a upper bound .45, so you will just see values lower than that.
If I understood correctly, you are asking why two exacts strings are not having distance = 0; this is because there is not string matching mechanism in this approach. Every language is trained separately, and then aligned using some words or sentences that we know that are equivalent. This is not necessarily bad, because you will find some words that are written exactly the same, but means different things in each language. However, in the examples that you show, this is just part of the noise introduced by the model.
We could add a second step, for example using Levenshtein distance, that would take advantage of string similarity , but it would work only for languages within the same scripts. If we had some training data, we could learn how to mix these two approaches and how useful would be the latter.
Jun 10 2019
May 30 2019
I have created and uploaded the full experiments and aligned parameters for these languages:
["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]
May 23 2019
Sorry, I've put the wrong link to the experiments in the previous comment, now is updated.
May 20 2019
You can find the results of the experiments here.
May 16 2019
Apr 29 2019
Apr 22 2019
Apr 17 2019
Apr 16 2019
From the [[https://meta.wikimedia.org/wiki/Research_talk:Expanding_Wikipedia_articles_across_languages/Inter_language_approach/Feedback | feedback page ]]set up by we got the following main two points:
This has been done and tracked in T215348
Documentation can be found here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Recommendation
Fixed JSON format issues for the APIs. Now they are working correctly.
Apr 10 2019
Apr 9 2019
The documentation for the updated version of the section recommender system can be found here:
API up and running, please check the documentation here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Recommendation
Mar 25 2019
The solution to this problem was the following:
Finally it's not just me squeezing notebooks memory :)
Mar 21 2019
Feb 21 2019
I think we are talking about three different things:
Feb 19 2019
@JAllemandou , yes. Having this by revision would be great!
Feb 12 2019
@Tbayer , great. Thanks.
Feb 11 2019
@jcrespo, the API works good for query specific pages/entities, not for example to know which pages that existing in X_wiki are missing on the Y_wiki.
My point here it is that the wikidata identifier is currently the main identifier for a page/concept, and that this fact is not reflected on the DB structure. I understand that this might be due historical reasons, but it would be good to think in a way that our DBs make easier to link content across wikis.
@EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that.
Looks good @JAllemandou, thanks.
This is a good workaround, but imho, we should have an structure or schema that makes this kind of tasks easier, specially for people outside without access to a cluster.
We do have one very large asset file at 1.9GB (word2vec embedding). I don't need that to be much bigger right now, but we're starting to discuss using embeddings more generally in the mid term and I don't have a good sense for how large they can become. @diego might have a better sense for how big these embeddings can be.
Feb 8 2019
Feb 7 2019
Check this notebook, apparently the number of white spaces is a pretty good indicator of the filename quality.
Feb 5 2019
Thanks @JAllemandou !
Jan 15 2019
Hi! I'm not sure what is this, but for sure you can delete diego_tmp.
Nov 28 2018
@bmansurov , eyeballing I can say:
Nov 27 2018
Hey @bmansurov , the list in Spanish it's over 11K. Maybe you could sample by cosine similarity, and create an stratified sample. Doing 11K sounds not realistic for me.