Exploration on content propagation across Wikimedia projects.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	diego
	May 16 2020, 12:08 AM

Description

Goal

Understand the cross-pollination of content across Wikimedia projects.

Problem definition

In this research we want to understand how content propagates across different languages in Wikipedia. As a unit of study we use Wikidata Items with sitelinks. Meaning that we consider the subset of Wikidata Items that has at least one article associated to any Wikimedia project. We start by evaluating once an item is created in one language what is the most probable next language that will propagate to. Our hypothesis is there is a relation between the creation of items in different languages. For example, if an item exists just in one language it might not propagate to more projects, but if the item already exists in 5 languages it is more likely to appear in a new project. Moreover,this probability may also depend on how related languages (as proxy for cultures) are.

Team

This work is done in collaboration with Giovanni Comarela (UFES), @Rvvalentim (Reesarch Intern / UFES) and Souneil Park (Telefónica Barcelona)

SubTasks**

Build a dataset of the creation time of all articles in Wikimedia projects, and their corresponding Wikidata Item.
Build a test subset
Define an ML setup for studying the content propagation phenomena.
Create a Model for predicting content propagation across Wikis.
Submit a paper.
Results dissemination (paper.)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		diego	T260564 Deliver 4 milestones towards 1 model to understand the diffusion of content on the Wikimedia projects
		Resolved		diego	T252933 Exploration on content propagation across Wikimedia projects.

Event Timeline

diego created this task.May 16 2020, 12:08 AM

Previous work:

Build a dataset of the creation time of all articles in Wikimedia projects, and their corresponding Wikidata Item.
Build a test subset

Weekly Update:

Onboarding @Rvvalentim
Testing a ML approach based on "a next word prediction problem", using LSTM networks;

diego updated the task description. (Show Details)May 23 2020, 12:04 AM

Week Update:

Testing Multivariable LSTM: implementation uses not just the order of creation of languages for each wiki item, but also inputs the interval between the creation of two consecutives languages.

Moved experiments to stats machines, in order to work with a higher sample.

Updates

Defined baseline models based in global and local popularity. (always predicting the most popular language transition).

Firsts results with Multivariable LSTM showing promising performance.

Updates

Adding topic information to the model.

Modifying the the temporal representation to an exponential time decay model.

diego triaged this task as High priority.Jun 13 2020, 12:58 AM

leila moved this task from Backlog to FY2019-20-Research-April-June on the Research board.Jun 19 2020, 11:18 PM

leila edited projects, added Research (FY2019-20-Research-April-June); removed Research.

diego updated the task description. (Show Details)Jun 22 2020, 2:41 PM

Updates

Testing a binary classification model to predict if an item will keep propagating to other wikis, or not.

Experimenting with different time representation models.

Updates

Divided problem in two steps:
- Will the item propagate (Binary classification )?
- Where will the item propagate?

diego moved this task from FY2019-20-Research-April-June to FY2020-21-Research-July-September on the Research board.Jul 13 2020, 4:22 PM

diego edited projects, added Research (FY2020-21-Research-July-September); removed Research (FY2019-20-Research-April-June).

Updates

Qualitative analysis on the model results.

Jamie7687 subscribed.Aug 7 2020, 7:27 PM

Updates

First round of analysis and models have finished. We are working on the report, that should be online in the following weeks.

diego mentioned this in T260564: Deliver 4 milestones towards 1 model to understand the diffusion of content on the Wikimedia projects.Aug 17 2020, 3:13 PM

diego added a parent task: T260564: Deliver 4 milestones towards 1 model to understand the diffusion of content on the Wikimedia projects.

Updates

We have published the first round of analysis.
Some important highlights:
- The size of the project (ie number of articles) is not correlated with likelihood of propagate content to other projects.
- Initial results shows correlation between cultural similarity and the likelihood of two or more projects to share similar content.
- For long cascades (ie, articles that exists in several languages), we are able to predict with a reasonable accuracy, the new languages that will create articles about the same topic.

Updates

We are currently working on preparing a paper to be submitted at the end of October.

diego updated the task description. (Show Details)Sep 4 2020, 8:48 PM

Updates

We are currently working on the paper, adding new analysis, and improvements on the model published in the first round of analysis.

Updates

Paper was submitted last week.

Next steps

I'll keep this task open, and add some new subtasks about disseminating our results within and outside the WMF.

diego updated the task description. (Show Details)Oct 26 2020, 2:51 PM

Updates

We are exploring a follow-up on this project, that based on our results, will focus on how to model the spread of disinformation.

Updates

We started a preliminary analysis propagation of sources across Wikis.

leila moved this task from FY2020-21-Research-July-September to FY2020-21-Research-October-December on the Research board.Jan 4 2021, 7:31 PM

leila edited projects, added Research (FY2020-21-Research-October-December); removed Research (FY2020-21-Research-July-September).

leila moved this task from FY2020-21-Research-October-December to FY2020-21-Research-January-March on the Research board.Apr 29 2021, 1:00 AM

leila edited projects, added Research (FY2020-21-Research-January-March); removed Research (FY2020-21-Research-October-December).

leila moved this task from FY2020-21-Research-January-March to FY2020-21-Research-April-June on the Research board.

leila edited projects, added Research (FY2020-21-Research-April-June); removed Research (FY2020-21-Research-January-March).

Updates

I gave an online talk, presenting this and other disinformation related work: https://youtu.be/jTF4gdertUA

Uupdates

We have shared the results on several places, such as the talk mentioned before, the research showcase, and
This work was published at ICWSM'21, paper publicly available.

diego closed this task as Resolved.May 14 2021, 3:08 PM

diego updated the task description. (Show Details)

Exploration on content propagation across Wikimedia projects.Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Exploration on content propagation across Wikimedia projects.
Closed, ResolvedPublic
Actions

Related Objects
Search...