Mon, Nov 16
- We have selected one Outreachy intern that will start on December. The intern will help on the task of developing the machine readable dataset.
- The dataset are ready. We are waiting for the paper to be published to share the link in public. Currently, the datasets are available under request via email.
- We started a preliminary analysis propagation of sources across Wikis.
Nov 2 2020
- We are extending the list to other languages: es, pt, ca.
- Reviewing outreaching applications that will help on creating the machine readable dataset.
- We are exploring a follow-up on this project, that based on our results, will focus on how to model the spread of disinformation.
Oct 29 2020
For more details on the timeline recommendations please check Isaac's comment here: T263874#6589856
Got you. Yes, looks good, please add it in the outreachy application.
@KemmieKemy thanks for submitting. You are doing great progress.
Oct 28 2020
@Rvvalentim , please can you double check if you need any of those files?
Oct 26 2020
- Paper was submitted last week.
Oct 10 2020
Oct 8 2020
Oct 2 2020
- We are currently working on the paper, adding new analysis, and improvements on the model published in the first round of analysis.
- The data can be found in HDFS: /user/dsaez/topicsForAllWikipediaPages2020-08-24AllProps.csv , OR
- You can also download the dataset from: https://analytics.wikimedia.org/published/datasets/topics/
- The dataset follows the same format described here: https://figshare.com/articles/Topics_for_each_Wikipedia_Article_across_Languages/12127434
Sep 30 2020
Sep 29 2020
Hi @RBrounley_WMF, thanks for sharing this and for the great work you are doing. Few comments from my side:
Sep 24 2020
@leila I see some overlap although this task seems to be broader than the one I'm working on. Given that I don't see much documentation nor code about this task, I prefer to not take responsibility on this.
Sep 4 2020
- We are currently working on preparing a paper to be submitted at the end of October.
- The two datasets have been prepared:
- One dataset with items that propagates across Wikipedias, removing bot activity.
- Another dataset about external references (links) across projects.
- A recent (with all the articles existing until Aug 31th) dump have been created. During the following days I will upload it in a public repository.
Aug 31 2020
- We have published the first round of analysis.
- Some important highlights:
- The size of the project (ie number of articles) is not correlated with likelihood of propagate content to other projects.
- Initial results shows correlation between cultural similarity and the likelihood of two or more projects to share similar content.
- For long cascades (ie, articles that exists in several languages), we are able to predict with a reasonable accuracy, the new languages that will create articles about the same topic.
Aug 17 2020
- We have finished to develop the first two models, one based in a DNN an another based in BoW. We are working on the report, that should be online (in meta) during the following weeks.
- We are also working in analyzing the different coverage of entities in Wikipedias across languages.
- Based on the two developed models, we are comparing with related work, and discussing about a new model or a fusion of the existing ones.
- First round of analysis and models have finished. We are working on the report, that should be online in the following weeks.
Jul 28 2020
@GoranSMilovanovic looks great.
Jul 27 2020
It is the sum of all sitelinks divided by the total number of articles. Do you think the formulation should be changed? Thanks.
No, I think is good like this.
What would be great is to - additionally - have the statistic only for Wipedias (or by project-type, eg: wikipedia, wicktionary, etc..)
Hi @GoranSMilovanovic ,
First of all, thanks for the work. Looks really good. My only doubt is about this sentence:
Jul 24 2020
- Analyzing which Properties in Wikidata receives much changes.
- Qualitative analysis on the model results.
Jul 22 2020
Jul 20 2020
@leila: this task was delayed due COVID-19 related problems. That is the reason to keep it on the FY20 April-June lane, we are working to finish it at the end of August. I'm ok moving it to the FY21 if that works for you.
Jul 14 2020
@elukey yes! this is working, thank you very much! It would be great if you can apply the same patch in the stat1005.
Jul 13 2020
And I confirm that this error is not happening on the stat1007.
Thanks @JAllemandou I'll try that.
Jul 10 2020
@elukey the pyspark --master yarn solution means to run another notebook (in another port)?
@elukey apparently no changes (I've restarted the server), I'm getting this error:
Jul 9 2020
Jul 6 2020
- Defined a new supervised approach, using a novel technique of negative sampling.
We have build collaborations with 4 researchers. All the paper work (MOUs) and announcements has been done. You can see more details here: https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
Jul 2 2020
Additional information, this is the python version (in the driver, I guess):
Jun 29 2020
- Divided problem in two steps:
- Will the item propagate (Binary classification )?
- Where will the item propagate?
- Working in representing Wikipedia page as set of Wikidata Items
- Collaborations with Souneil and Giovanni has been announced in @wiki-research-l
Jun 22 2020
- No updates this week
- Defined a subset of pages for manual evaluation.
- Testing a binary classification model to predict if an item will keep propagating to other wikis, or not.
Jun 13 2020
External researchers had solved this problem: https://zenodo.org/record/2539424
- Giovanni and Souneil has been added to the official list of formal collaborators.
- Started to design a crowdsourced task to label data.
Jun 10 2020
Jun 9 2020
Thanks @ashley and @JAllemandou . I've solved the problem using wmf.mediawiki_history, so no emergencies from my side. However, is that the expected behavior for wmf_raw.mediawiki_revision? And in the case that it is, for the records, could you please provide an example of the join you are suggesting, I don't see an actor table in wmf_raw:
Jun 6 2020
- All legal procedures has been finalized.
- Improving the sec2sec model.
- Adding topic information to the model.
Jun 5 2020
Yes @Nuria is an internship. End date August 10th.
great! gpu's for everybody, power to the people :) !!!
These are the groups we need for @YiJuLu:
Jun 3 2020
@Dzahn & @Aklapper here you have @YiJuLu 's key: https://office.wikimedia.org/wiki/User:Diego_(WMF)/internKeys
Jun 1 2020
- Defined baseline models based in global and local popularity. (always predicting the most popular language transition).