Thu, Jun 13
oh! I see, that number is distance, so 0 would be perfect match, 1 is not matching at all. I've already put a upper bound .45, so you will just see values lower than that.
If I understood correctly, you are asking why two exacts strings are not having distance = 0; this is because there is not string matching mechanism in this approach. Every language is trained separately, and then aligned using some words or sentences that we know that are equivalent. This is not necessarily bad, because you will find some words that are written exactly the same, but means different things in each language. However, in the examples that you show, this is just part of the noise introduced by the model.
We could add a second step, for example using Levenshtein distance, that would take advantage of string similarity , but it would work only for languages within the same scripts. If we had some training data, we could learn how to mix these two approaches and how useful would be the latter.
Mon, Jun 10
Thu, May 30
I have created and uploaded the full experiments and aligned parameters for these languages:
["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]
May 23 2019
Sorry, I've put the wrong link to the experiments in the previous comment, now is updated.
May 20 2019
You can find the results of the experiments here.
May 16 2019
Apr 29 2019
Apr 22 2019
Apr 17 2019
Apr 16 2019
From the [[https://meta.wikimedia.org/wiki/Research_talk:Expanding_Wikipedia_articles_across_languages/Inter_language_approach/Feedback | feedback page ]]set up by we got the following main two points:
This has been done and tracked in T215348
Documentation can be found here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Recommendation
Fixed JSON format issues for the APIs. Now they are working correctly.
Apr 10 2019
Apr 9 2019
The documentation for the updated version of the section recommender system can be found here:
API up and running, please check the documentation here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Recommendation
Mar 25 2019
The solution to this problem was the following:
Finally it's not just me squeezing notebooks memory :)
Mar 21 2019
Feb 21 2019
I think we are talking about three different things:
Feb 19 2019
@JAllemandou , yes. Having this by revision would be great!
Feb 12 2019
@Tbayer , great. Thanks.
Feb 11 2019
@jcrespo, the API works good for query specific pages/entities, not for example to know which pages that existing in X_wiki are missing on the Y_wiki.
My point here it is that the wikidata identifier is currently the main identifier for a page/concept, and that this fact is not reflected on the DB structure. I understand that this might be due historical reasons, but it would be good to think in a way that our DBs make easier to link content across wikis.
@EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that.
Looks good @JAllemandou, thanks.
This is a good workaround, but imho, we should have an structure or schema that makes this kind of tasks easier, specially for people outside without access to a cluster.
We do have one very large asset file at 1.9GB (word2vec embedding). I don't need that to be much bigger right now, but we're starting to discuss using embeddings more generally in the mid term and I don't have a good sense for how large they can become. @diego might have a better sense for how big these embeddings can be.
Feb 8 2019
Feb 7 2019
Check this notebook, apparently the number of white spaces is a pretty good indicator of the filename quality.
Feb 5 2019
Thanks @JAllemandou !
Jan 15 2019
Hi! I'm not sure what is this, but for sure you can delete diego_tmp.
Nov 28 2018
@bmansurov , eyeballing I can say:
Nov 27 2018
Hey @bmansurov , the list in Spanish it's over 11K. Maybe you could sample by cosine similarity, and create an stratified sample. Doing 11K sounds not realistic for me.
Nov 15 2018
Nov 5 2018
@Krenair , I know that this field is needed in that table on the database located in analytics-store.eqiad.wmnet. I'm not sure what is the procedure/dependencies to do this, sorry.
@Aklapper , I'm not sure which will be the proper tag. I don't see suggestions related with the MariaDB Replicas.
Hi @Aklapper ,
I'm referring to the MariaDB tables on analytics-store.eqiad.wmnet. I suppose that this requires a change in the schema.
Oct 18 2018
Oct 15 2018
Oct 4 2018
Oct 1 2018
Sep 26 2018
Kateryna is working on this: https://meta.wikimedia.org/wiki/Research:Matching_Red_Links_with_Wikidata_Items
Sep 25 2018
Makes sense! Thanks!
@bmansurov , interesting. I've tried with 'uz' and also don't see anything repeated. Giving that 'uz' current is a single file that make me things that is something related with the parallelization.
I'm cleaning my code, and found that my parser produce duplicated outputs. Each row is present twice in the output. These two repeated rows are not together, meaning that line 1 is not repeated in line 2, but in line X with X > 2. Can you please have a look here and try to guess what I am doing wrong?
For sure I can do a post filter, but I would love to understand what is happening.
Sep 6 2018
Sep 1 2018
Beyond my subjective opinion about these rankings, I'm not sure what I should evaluate here. I understand that in the paper there is already an evaluation methodology. Are you trying to measure some new that is not covered by that methodology?
Aug 29 2018
Yeah! it works!
Aug 25 2018
Aug 22 2018
I think Apache NiFi is the usual way to move large local directories to HDFS.
Aug 21 2018
Aug 14 2018
Aug 13 2018
Py4JJavaError: An error occurred while calling o60.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 36 in stage 3.0 failed 4 times, most recent failure: Lost task 36.3 in stage 3.0 (TID 4214, analytics1050.eqiad.wmnet, executor 61): ExecutorLostFailure (executor 61 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 2.1 GB of 2 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'll try with some "tweaks" and let you know.
Aug 9 2018
@JAllemandou , I've dona a copy/paste of your code in the Notebook and get same error than before:
Aug 8 2018
@Ottomata for example this:
Aug 6 2018
With pyspark I'm getting this error a lot (even when I'm working with small datasets, for example a list of 1 million integers):
Jul 9 2018
Regarding the visualization problems, autocomplete, but specially the inline plots, here there is a possible way to explore:
After restarting the notebook, now I get this error. I have the same error if I use pyspark from stat1005. The solution that I found for this is working with python2.7 in pyspark
Jul 7 2018
I'm ok changing to another language. I don't have strong preferences between Korean or Vietnamese, I think we should select the one with more probabilities of getting good/fast translations.
Jul 5 2018
@Trizek-WMF , thanks for your efforts.
Can we push more for translators from/to Japanese? This is the main missing piece.
Jul 2 2018
Jun 18 2018
Jun 13 2018
@Trizek-WMF, I agree with you that instructions could be improved. I have myself needed to struggle a bit to translate to Spanish. But, we have already done many iterations to reach this quality of instructions, and although they are far from perfect, I don't thing that we should be blocked on this. So, please, let's just use this instructions.
Jun 12 2018
@Trizek-WMF : I have completed the translation to Spanish.