Page MenuHomePhabricator

Exploration on content propagation across Wikimedia projects.
Closed, ResolvedPublic

Description

Goal

Understand the cross-pollination of content across Wikimedia projects.

Problem definition

In this research we want to understand how content propagates across different languages in Wikipedia. As a unit of study we use Wikidata Items with sitelinks. Meaning that we consider the subset of Wikidata Items that has at least one article associated to any Wikimedia project. We start by evaluating once an item is created in one language what is the most probable next language that will propagate to. Our hypothesis is there is a relation between the creation of items in different languages. For example, if an item exists just in one language it might not propagate to more projects, but if the item already exists in 5 languages it is more likely to appear in a new project. Moreover,this probability may also depend on how related languages (as proxy for cultures) are.

Team

This work is done in collaboration with Giovanni Comarela (UFES), @Rvvalentim (Reesarch Intern / UFES) and Souneil Park (Telefónica Barcelona)

  • SubTasks**
  • Build a dataset of the creation time of all articles in Wikimedia projects, and their corresponding Wikidata Item.
  • Build a test subset
  • Define an ML setup for studying the content propagation phenomena.
  • Create a Model for predicting content propagation across Wikis.
  • Submit a paper.
  • Results dissemination (paper.)

Event Timeline

Previous work:

  • Build a dataset of the creation time of all articles in Wikimedia projects, and their corresponding Wikidata Item.
  • Build a test subset

Weekly Update:

  • Onboarding @Rvvalentim
  • Testing a ML approach based on "a next word prediction problem", using LSTM networks;

Week Update:

  • Testing Multivariable LSTM: implementation uses not just the order of creation of languages for each wiki item, but also inputs the interval between the creation of two consecutives languages.
  • Moved experiments to stats machines, in order to work with a higher sample.

Updates

  • Defined baseline models based in global and local popularity. (always predicting the most popular language transition).
  • Firsts results with Multivariable LSTM showing promising performance.

Updates

  • Adding topic information to the model.
  • Modifying the the temporal representation to an exponential time decay model.
diego triaged this task as High priority.Jun 13 2020, 12:58 AM

Updates

  • Testing a binary classification model to predict if an item will keep propagating to other wikis, or not.
  • Experimenting with different time representation models.

Updates

  • Divided problem in two steps:
    • Will the item propagate (Binary classification )?
    • Where will the item propagate?

Updates

  • Qualitative analysis on the model results.

Updates

  • First round of analysis and models have finished. We are working on the report, that should be online in the following weeks.

Updates

  • We have published the first round of analysis.
  • Some important highlights:
    • The size of the project (ie number of articles) is not correlated with likelihood of propagate content to other projects.
    • Initial results shows correlation between cultural similarity and the likelihood of two or more projects to share similar content.
    • For long cascades (ie, articles that exists in several languages), we are able to predict with a reasonable accuracy, the new languages that will create articles about the same topic.

Updates

  • We are currently working on preparing a paper to be submitted at the end of October.

Updates

  • We are currently working on the paper, adding new analysis, and improvements on the model published in the first round of analysis.

Updates

  • Paper was submitted last week.

Next steps

  • I'll keep this task open, and add some new subtasks about disseminating our results within and outside the WMF.

Updates

  • We are exploring a follow-up on this project, that based on our results, will focus on how to model the spread of disinformation.

Updates

  • We started a preliminary analysis propagation of sources across Wikis.

Updates

Uupdates

  • We have shared the results on several places, such as the talk mentioned before, the research showcase, and
  • This work was published at ICWSM'21, paper publicly available.
diego updated the task description. (Show Details)