Page MenuHomePhabricator

Exploratory data analysis: localization and maintenance of translated content
Closed, ResolvedPublic

Description

Background

This research emerges from T396029: [Epic] Guidance for Article and Section creation which is intended to improve the effectiveness of article and section creation activities.

WMF research related to Cx have used the following core metrics for "quality" when assessing articles created using Cx: whether or not an article is deleted, and article-quality-model features.

My intention is to make recommendations for how to extend our definition of quality so that it can be used for better assessing Cx-created articles. Improvements that we will explore may include:

  1. Longitudinality - how does content progress (or not) over time?
    • Takes into account the localization over time. (E.g., are new, local-language sources incorporated into the translation or are the original sources largely retained?)
    • Takes into account the article’s ‘freshness’, or how up-to-date the article’s content is. (E.g., does translated content remain fresh as new information is published about the subject of the article or does it remain static and slowly fall out-of-date as the source article continues to evolve?
  2. Thicker features - what aspects tend to be missing?
    • Can assess the presence or absence of important features, like infoboxes
    • Can distinguish between media types, like images versus icons.
    • Can determine language of sources
    • Can measure (in tandem with the article-country model) degree of localization of content, sources

The goal of this improved metric is to address the following questions: What does Cx make easy? Where are there gaps? And how do we capture those gaps through measurement?

Scope

  • Wikipedia
  • Main namespace (article pages)
  • Articles created using Cx
  • Articles created from scratch, for baselines

Methods

Methods will be adapted from T371934: [medium] Analyze localization and maintenance of translated content.

Outputs
An exploratory analysis in PAWS with results presented on Meta.
This research can inform article and section creation guidance (T396029) as well as product work related to content deficits in translated articles (T397282).

Details

Due Date
Sep 29 2025, 11:00 PM

Event Timeline

Hi @CMyrick-WMF, can you please associate one or more active project tags with this task (via the Add Action...Change Project Tags dropdown)? That will allow to see a task when looking at project workboards or searching for tasks in certain projects, and get notified about a task when watching a related project tag. Thanks!

Weekly update:

Next week:

  • Add to query of rev ids and timestamps the rev id of the source article that was used for the creation of the target article
  • Begin comparisons of source and target articles at multiple timepoints.
CMyrick-WMF changed the task status from Open to In Progress.Jul 11 2025, 8:57 PM

Weekly update:

  • Finished initial queries (notebook)
    • Added to query of rev ids and timestamps the rev id of the source article that was used for the creation of the target article
  • Created test dataset for queried sample articles using mwparserfromhtml (same notebook)
    • Focusing (for now) on the following: counts of images, audio files, video files, sources, refs, sections (headings), as well as infoboxes and page length
    • Found that mwparserfromhtml isn't catching pronunciation audiofiles (filed ticket)
    • Found that the suggested 100x100 threshold for non-icon images misses some images in galleries, as well as some images in infoboxes; testing to see if 60x60 may be better threshold
  • Began explorative analysis using visualizations to compare source articles (enwiki articles) with target articles (eswiki articles) at two timepoints: on the day of translation, and the most recent version of the article (notebook)

Weekly update:

  • Continued calculating comparative stats between enwiki source articles and eswiki translated articles (notebook)
    • Based on this small sample of articles, increases of unique sources seems to be one of the areas where translated articles lag behind source articles
    • Next steps: use mwparserfromhtml to determine (of those) how many URLs are retained; how many URLs in the translation are new; etc; to see, e.g., whether articles that aren't seeing a net increase in sources are/aren't seeing an increase in new sources. Currently reading through wikistew docs.
  • Brainstormed options for random sampling of published translations API
    • Current options/alternatives: use python to generate random offset numbers to apply to API query; use table instead of API and then SELECT RAND()

Weekly update:

  • Dataset decisions:
    • To start with, the dataset will be a large, random selection of ~1000 translated articles.
    • To start with, the sample will be limited to articles translated from English Wikipedia; but once the analysis is complete, will expand to other languages.
    • To start with, the sample will be limited to articles translated into arwiki, eswiki, fawiki, frwiki, hewiki, itwiki, ptwiki, trwiki, viwiki, and zhwiki (top 10 languages translated-into-from-enwiki during 2016-2020 timeframe)
    • Determined best way to do random sampling is to query the wmf_product.cx_translations table rather than using the API.
  • Finished writing random selection query
  • Finished writing query & loop to pull HMTL for each article randomly selected
  • Workflow decision: now that the datasets are becoming larger, I will be
    • switching from PAWS to stat machine (see, e.g., Why use PAWS?)
    • publishing notebooks on Gitlab

Weekly update:

  • Updated initial dataset decisions:
    • Initial N will be ~1000 (until final bugs and analysis code get worked out; 1000 is plenty time-consuming)
    • Initial translation sample will be restricted to translations of enwiki articles into the following: arwiki, eswiki, fawiki, frwiki, hewiki, itwiki, ptwiki, trwiki, viwiki, and zhwiki (i.e., top 10 languages translated into from English in 2017-2020).
  • Finished the sample dataset creation scripts
    • Rewrote sample script to sample translations across the 10 wikis
    • Debugged html parsing script (needed to ignore rev id's of revision-deletions, since html unavailable for those revs -- which was breaking loop)
    • Published notebook on gitlab: sample_creation.ipynb

Weekly update:

  • Met with Fabian to discuss best way to compare list of URLs from two timepoints
  • Next steps: create new dataframe with each article's html (per article), from which to parse out a list of URLs to put into arrays which can then be compared
Miriam triaged this task as High priority.Aug 27 2025, 4:35 PM
Miriam updated the task description. (Show Details)
Miriam set Due Date to Sep 29 2025, 11:00 PM.

Weekly update:

  • Updated sample_creation.ipynb to create a dataframe with the full article html for each article (original article and translated article) at the two timepoints ('before' and 'after', with 'before' meaning on the day the translation occurred, and 'after' meaning most recently).
  • Started sources.ipynb notebook which runs a loop to calculate the following for each before-and-after pair of articles: number of sources before, number of sources now, number of shared sources, number of new sources

Weekly update:

  • Worked on r_stats.ipynb notebooks, which provides summary stats comparing the before-and-after states (i.e., state at day of translation and state at most recent query) for the source article on enwiki and the translated article
  • Completed sources.ipynb
    • Completed new dataset that compiles "before-and-after" matrices of lists of references for source articles (i.e. enwiki articles) and translated articles.
    • Completed calculations of reference counts, net change, and percent diff
    • Completed calculation of number of new sources, using the before-and-after matrices, for the source articles and translated articles
    • Added summary stats to r_stats.ipynb

Weekly update

1.Updated create_sample.ipynb notebook:

  • Increased sampling range to include articles translated between 2017 and 2023
  • Changed the "max revision" from the most current to the max revision within the article's first two year's of publication (to help normalize for article age)
  • New sample added: baseline articles (i.e. articles to provide a baseline comparison for the translated articles)
    • These baseline articles are articles from 'es', 'fr', 'ar', 'tr', 'uz', 'fa', 'pt', 'vi', 'bn', and 'uk' wikis that weren't created using Cx
    • Same random sampling technique used as previous cx-related sample
  • Debugged (re. scripts breaking when trying to pull html from article versions that have been deleted)

2.Investigated "before" and "after" comparisons of article references, with Isaac

  • Determined that many articles being labeled as dropped or new were actually not dropped or new, just slightly changed
    • e.g., Adding internet archive link
    • e.g., Adding the retrieve date
    • e.g., Syntax change (examples: adding of the www; ISBN update)
    • e.g., Title change (example: "Name" to "Name - Biography")
    • e.g., Changed from unformatted (e.g. URL only) to formatted ref
    • e.g., Broken references fixed (for examples: look for "{{cite web}} Unknown..")
  • @Isaac wrote a script for catching matching URLs shared between references: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/HTML-dumps/override-mwparserfromhtml.ipynb#Deeper-reference-matching
    • TODO: Incorporate this^ script into my analysis to look more deeply at references; visualize findings in r_stats.ipynb
    • TODO: Incorporate script for looking at infoboxes; visualize findings in r_stats.ipynb

3.Updated r_stats.ipynb notebook (still need to push changes to gitlab):

  • Reran notebook with updated sample
  • Began incorporating the baseline sample data into the notebook for visualization

Weekly update:

Project rescope: Focusing on the features aspect rather than the longitudinal aspect of the project. More specifically, a focus on the features of articles created from cx vs. those created from scratch, in order to better provide product recommendations for helping editors overcome huddles faced when create articles from scratch, per T396029.

Examined a new set of metrics for comparing features present at creation, to see if there are differences for articles created using cx vs. comparable articles created from scratch.

  • Update to create_sample.ipynb:
    • "Cx'ed" articles: randomly sampled ~1000 articles across 5 wikis (ar, es, fr, tr, & uz) which had been translated from enwiki.
    • "From scratch" articles: randomly sampled ~1000 articles across the same 5 wikis which were created "from scratch"
    • For these ~2000 articles, queried and parsed the HTML from each article's very first revisions (i.e., creation).
  • r_article_creation_comparison.ipynb analysis:
    • Compared features at time of creation for cx'ed articles vs. "from scratch" articles:
      • Number of article images >=60x60 pixels
      • Number of categories
      • Number of citations (i.e. in-text citations)
      • Number of headings (i.e. section headings)
      • Number of infoboxes
      • Number of interwiki links
      • Page length
      • Number of sources (i.e. unique references)
      • Number of wikilinks
    • Preliminary findings:
      • The sample of cx'ed articles have more citations, headings, sources, and greater page length than “from-scratch” articles at time of creation
      • The sample of "from-scratch" articles have more categories, interwiki links, wikilinks, and presence of infoboxes than cx articles at time of creation.
    • ^ These findings might suggest that features provided by the cx tool make it easier for editors to add sources, cite sources, create more sections, and add more text during their very first edit than if they were writing the article from scratch. Larger samples and statistical testing are are needed before making recommendations.
  • Next steps:
    • Update visualizations, because there are often better ways to visualize data than boxplots.
    • I need to control for article size/popularity/topic, because there are likely differences in what the first version(s) of articles that are heavily edited or read (e.g., some timely current event) vs not, or some topics vs others. So I've queried pageviews during the article's first 30 days as well as edits during the first 30 days (see query_sample_metadata.ipynb). I have already begun incorporating these into my r_article_creation_comparison.ipynb analysis via faceting. I will continue with this, and explore matched binning on topic as well, next week.
    • In addition to looking at the article's very first edit, I need to examine the first stable version of the article (i.e., the max revision within the first 24 hour of creation).

Weekly update:

  • Data improvements (see create_sample.ipynb)
    • Improved sample quality: removed articles created by bots, removed articles created by IPs (so as to have 1:1 comparisons with cx'ed articles)
    • Improved speed of some queries using table caching, etc
  • Analysis with bins (see r_article_creation_comparison.ipynb)
    • Finalized bins for stratification (see query_sample_metadata.ipynb)
      • N edits article received in first 30 days
      • N pageviews articles received in first 30 days
      • N edits of article's creator at time of creation
      • (Boolean) topic, using high level topics from language-agnostic topic model: Biographies, Culture (non-biography), Geography, STEM, and History & Society
    • Visualized differences in means per bin
    • Ran two sample (non paired) t-tests per feature per language edition
  • Analysis with matching (see r_article_creation_comparison.ipynb)
    • Employed multiple types of matching using MatchIt R package:
      • Type 1: 1:1 NN PS matching w/o replacement, which is 1:1 nearest neighbor matching on propensity score (Rosenbaum and Rubin 1983), without replacement)
      • Type 2: Full matching on a probit PS (full matching, which matches every treated unit to at least one control and every control to at least one treated unit (Hansen 2004; Stuart and Green 2008). and linking on probit for the propensity score model.)
      • For details about these matching types, incl. sources cited, see https://cran.r-project.org/web/packages/MatchIt/vignettes/MatchIt.html#Matching
    • Ran paired t-test, using the dataset from Type 1 matching, per feature per language edition

Weekly update:

Data improvements

  • Streamlined queries
  • Rewrote code to account for the following:
    • Exclude articles written by bots
    • Exclude articles written by anonymous users
    • For Cx'ed articles, for initial translated version, use the Cx rev timestamp rather than MIN rev timestamp because some articles were Cx'ed days, months, or years after the article was initially created (TIL)

Analyses (see r_article_hrs24_comparison.ipynb)

This week I reran analyses similar to last week, but looking at different timepoints instead of looking at only the initial/very first edit of the article, I looked at...

  • The version of the article 24 hours after its translation/creation; preliminary findings show
    • Cx'ed articles have more sources (i.e. unique references), in-text citations, headings (i.e. section headings), and greater page length 24 hours after initial translation
    • "From-scratch" articles have a greater presence of infoboxes 24 hour after creation.
  • Changes between translation/creation and 24 hours later; preliminary findings show
    • Cx'ed articles see bigger increases in infobox presence, interwiki links, wikilinks, images, and categories between initial translation and 24 hours later
    • "From-scratch" articles see bigger increases in page length between initial creation and 24 hours later.