Wed, Apr 1
Mon, Mar 30
- No updates this week
Wed, Mar 18
Wed, Mar 11
@santhosh that file is 30Mb, where can I get the full 31.7GB file?
Mon, Mar 9
Mar 2 2020
Feb 27 2020
Hi @santhosh! Yep, this is super useful, I considering do something similar, to find some ground-truth for my approaches.
Feb 21 2020
Feb 19 2020
Feb 18 2020
I'll need around 3 weeks (aprox) to finish this.
Feb 17 2020
- Reviewing Corona Virus related cases.
Feb 13 2020
@elukey I've deleted 120Gb. Moved back to 580G :)
Hey @elukey, for this task I need to download - at least - 50 languages models, each of them is around 8G, so I'll use around 400G. I'll do my best to make this work with that data on HDFS, but for starting I need to have it in a local machine. I'm now using stat1007 for my experiments. Is ok if I store temporarily the models there?
I wont say "employed team only", I'll share all the code I'm creating for this, but currently I can't think in tasks where I need help for this. Please feel free to contribute to the repo previously mentioned, and tell me if I you need my help.
Feb 11 2020
For the records, given that in geoeditors_edits_monthly we store information about countries using ISO 3166-1 alpha-2, we are losing Bonaire and Kosovo. The former has no code at all, and the latter has only Alpha-3.
We might want to consider use the full country name to avoid this kind problems in the future.
Update from last two weeks:
Feb 10 2020
Update from last two weeks:
Feb 3 2020
Jan 31 2020
Btw, there is a way to mount the Wikipedia dumps on those machines?
Jan 30 2020
Jan 22 2020
In the short-term, the solution is to use the code as was designed to work with Spark.
Jan 21 2020
Hey @Ottomata @JAllemandou, please can you check why Pyspark kernels are not working? I've been trying for a week, with the differents pyspark kernels on the notebook machines, but the notebook freezes with any command (even is you try no-spark commands), pure python is working ok. Thx
Weekly update: Gather internship proposals within the team, and shared the requirements with @leila
Weekly Update: Preparing the dataset.
Jan 10 2020
Dec 11 2019
Have you already have a look in our Section Recommendation demo app? That's currently working for 6 languages. Expand it and specially maintaining for many languages could be complex, however, a simplified version of that system, using a dump approach instead of an API (like we have done with the template parameter alignment) it could be feasible.
Oct 25 2019
Oct 10 2019
Here a summary of the work done: https://meta.wikimedia.org/wiki/Research:Disinformation_Literature_Review
Oct 8 2019
@Nuria the entropy approach looks very cool, thanks for sharing.
We (research) will be supporting @ssingh on his work related to this problem, especially focused in censorship.
Sep 27 2019
Sep 20 2019
Sep 16 2019
@santhosh , I've never tried that (I understand that docker files are kind of virtual environment, but honestly, I've never used it). We can try, but remember that the person will need access to our spark cluster. Do you know if the docker environment can connect with Yarn?
Aug 28 2019
@diego please review below as well since you worked with GII folks during the past iteration:
Looks ok to me.
Aug 14 2019
Aug 13 2019
Aug 7 2019
Aug 5 2019
Aug 1 2019
Jul 29 2019
Jul 11 2019
This has been solved here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Results
Check an example here: https://secrec.wmflabs.org/API/alignment/en/ja/Work
Jul 10 2019
@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?
Considering the feedback obtained in T225136, we conclude that the prioritization should be adapted to the characteristic of the editor being assisted. We can split editors in two disjoint groups, generating two different types of recommendations:
Jul 5 2019
I think this use-case highlight the need for a canonical (standanrized) cross-lingual topic model, that we could all use as the reference for all the projects within the WMF.
Jun 13 2019
oh! I see, that number is distance, so 0 would be perfect match, 1 is not matching at all. I've already put a upper bound .45, so you will just see values lower than that.
If I understood correctly, you are asking why two exacts strings are not having distance = 0; this is because there is not string matching mechanism in this approach. Every language is trained separately, and then aligned using some words or sentences that we know that are equivalent. This is not necessarily bad, because you will find some words that are written exactly the same, but means different things in each language. However, in the examples that you show, this is just part of the noise introduced by the model.
We could add a second step, for example using Levenshtein distance, that would take advantage of string similarity , but it would work only for languages within the same scripts. If we had some training data, we could learn how to mix these two approaches and how useful would be the latter.
Jun 10 2019
May 30 2019
I have created and uploaded the full experiments and aligned parameters for these languages:
["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]
May 23 2019
Sorry, I've put the wrong link to the experiments in the previous comment, now is updated.
May 20 2019
You can find the results of the experiments here.