Wed, Aug 14
Tue, Aug 13
Wed, Aug 7
Mon, Aug 5
Thu, Aug 1
Mon, Jul 29
Jul 11 2019
This has been solved here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Results
Check an example here: https://secrec.wmflabs.org/API/alignment/en/ja/Work
Jul 10 2019
@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?
Considering the feedback obtained in T225136, we conclude that the prioritization should be adapted to the characteristic of the editor being assisted. We can split editors in two disjoint groups, generating two different types of recommendations:
Jul 5 2019
I think this use-case highlight the need for a canonical (standanrized) cross-lingual topic model, that we could all use as the reference for all the projects within the WMF.
Jun 13 2019
oh! I see, that number is distance, so 0 would be perfect match, 1 is not matching at all. I've already put a upper bound .45, so you will just see values lower than that.
If I understood correctly, you are asking why two exacts strings are not having distance = 0; this is because there is not string matching mechanism in this approach. Every language is trained separately, and then aligned using some words or sentences that we know that are equivalent. This is not necessarily bad, because you will find some words that are written exactly the same, but means different things in each language. However, in the examples that you show, this is just part of the noise introduced by the model.
We could add a second step, for example using Levenshtein distance, that would take advantage of string similarity , but it would work only for languages within the same scripts. If we had some training data, we could learn how to mix these two approaches and how useful would be the latter.
Jun 10 2019
May 30 2019
I have created and uploaded the full experiments and aligned parameters for these languages:
["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]
May 23 2019
Sorry, I've put the wrong link to the experiments in the previous comment, now is updated.
May 20 2019
You can find the results of the experiments here.
May 16 2019
Apr 29 2019
Apr 22 2019
Apr 17 2019
Apr 16 2019
From the [[https://meta.wikimedia.org/wiki/Research_talk:Expanding_Wikipedia_articles_across_languages/Inter_language_approach/Feedback | feedback page ]]set up by we got the following main two points:
This has been done and tracked in T215348
Documentation can be found here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Recommendation
Fixed JSON format issues for the APIs. Now they are working correctly.
Apr 10 2019
Apr 9 2019
The documentation for the updated version of the section recommender system can be found here:
API up and running, please check the documentation here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Recommendation
Mar 25 2019
The solution to this problem was the following:
Finally it's not just me squeezing notebooks memory :)
Mar 21 2019
Feb 21 2019
I think we are talking about three different things:
Feb 19 2019
@JAllemandou , yes. Having this by revision would be great!
Feb 12 2019
@Tbayer , great. Thanks.
Feb 11 2019
@jcrespo, the API works good for query specific pages/entities, not for example to know which pages that existing in X_wiki are missing on the Y_wiki.
My point here it is that the wikidata identifier is currently the main identifier for a page/concept, and that this fact is not reflected on the DB structure. I understand that this might be due historical reasons, but it would be good to think in a way that our DBs make easier to link content across wikis.
@EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that.
Looks good @JAllemandou, thanks.
This is a good workaround, but imho, we should have an structure or schema that makes this kind of tasks easier, specially for people outside without access to a cluster.
We do have one very large asset file at 1.9GB (word2vec embedding). I don't need that to be much bigger right now, but we're starting to discuss using embeddings more generally in the mid term and I don't have a good sense for how large they can become. @diego might have a better sense for how big these embeddings can be.
Feb 8 2019
Feb 7 2019
Check this notebook, apparently the number of white spaces is a pretty good indicator of the filename quality.
Feb 5 2019
Thanks @JAllemandou !
Jan 15 2019
Hi! I'm not sure what is this, but for sure you can delete diego_tmp.
Nov 28 2018
@bmansurov , eyeballing I can say:
Nov 27 2018
Hey @bmansurov , the list in Spanish it's over 11K. Maybe you could sample by cosine similarity, and create an stratified sample. Doing 11K sounds not realistic for me.
Nov 15 2018
Nov 5 2018
@Krenair , I know that this field is needed in that table on the database located in analytics-store.eqiad.wmnet. I'm not sure what is the procedure/dependencies to do this, sorry.
@Aklapper , I'm not sure which will be the proper tag. I don't see suggestions related with the MariaDB Replicas.
Hi @Aklapper ,
I'm referring to the MariaDB tables on analytics-store.eqiad.wmnet. I suppose that this requires a change in the schema.
Oct 18 2018
Oct 15 2018
Oct 4 2018
Oct 1 2018
Sep 26 2018
Kateryna is working on this: https://meta.wikimedia.org/wiki/Research:Matching_Red_Links_with_Wikidata_Items
Sep 25 2018
Makes sense! Thanks!
@bmansurov , interesting. I've tried with 'uz' and also don't see anything repeated. Giving that 'uz' current is a single file that make me things that is something related with the parallelization.
I'm cleaning my code, and found that my parser produce duplicated outputs. Each row is present twice in the output. These two repeated rows are not together, meaning that line 1 is not repeated in line 2, but in line X with X > 2. Can you please have a look here and try to guess what I am doing wrong?
For sure I can do a post filter, but I would love to understand what is happening.
Sep 6 2018
Sep 1 2018
Beyond my subjective opinion about these rankings, I'm not sure what I should evaluate here. I understand that in the paper there is already an evaluation methodology. Are you trying to measure some new that is not covered by that methodology?