Page MenuHomePhabricator

[SPIKE] Technical Feasibility of Identifying Substantial Content Deficits in Translated Articles
Open, Needs TriagePublicSpike

Description

Description:

This spike aims to explore the technical feasibility and complexity of identifying "substantial content deficits/disparities" between source and translated (target) Wikipedia article sections. Understanding our ability to accurately and efficiently detect these deficits is crucial to determining the viability of the hypothesis for implementation. This initial exploration will inform the LPL team's roadmap and potential experiment design starting Q2 (FY25/26) onwards.

Background:

-According to Collaborative Translation research, 2025, we uncovered that participants and organizers rarely interacted with content created during campaigns, post-translation. Very few cases of article improvement campaigns were done by organizers.
-As of March 2025, about 20% of all translated articles are considered to have met the standard quality criteria. Of those, 13% met the standard at the time of creation, while another 7% were improved post-creation. About 78% of the translated articles do not meet the standard quality.

Insights:
1.Addressing the issue of low % of CX article quality offers a significant justification for interventions that provide a clearer path for content improvement.
2.The low engagement with translated articles post-creation represents a missed opportunity for maintenance and updates.
3.Improving the impact of translations done during topical campaigns, which, by nature, normally increases the volume of content significantly.

-Examples of interventions/approaches that currently address Article Quality outside the CX workflow:
A.Proactive approaches: Reference Check,
B.Reactive approaches: Add an Image , Add a Link, Special Pages: Maintenance

Technical Areas to Explore:

Checking text/size (Primary):
1.Detecting extreme imbalances in Article size (e.g., 8000kB source vs. 1000kB target);
a) certain languages vary significantly in how much text is needed to convey the same meaning.
b) differences in content length might also stem from cultural adaptation.

Other checks:
2.Edit activity as a Proxy for Stagnation/content needing updates:
a) when the target article hasn’t been edited since it was first translated, to determine "freshness or stagnation".
b) when the edit activity on the source article surpasses the translated article -> limitation: editing activity as a signal can be noisy in cases where there's an edit war, or a trending topic like the Pope article.

3.Presence/absence of images in the corresponding source vs. the target.

4.Identify and count references/citations, and compare their presence between the source and target.

5.Detecting infoboxes
a) infoboxes are template-based, requiring investigation into cross-wiki template compatibility

6.Detecting missing wikilinks
a) this might not be a viable signal, given that corresponding target articles may not yet exist, which could lead to unhelpful suggestions.

Acceptance Criteria:

As an outcome of this spike, we will document the recommendations for #1 listed under (primary) above:

  • Feasibility: complexity, challenges, caveats.
  • Proposed approach for a 1st iteration/ PoC.
  • Effort and support we may need from other teams.
  • Clear definition of terms: currently we are using the terms content deficit/ disparity/ ......
Notes:

Other Explorations outside the scope of this spike:
7.A new workflow that incorporates a reader's PoV.

Draft hypothesis:

If we prioritize article sections that have substantial content deficit/disparity compared to its source section, then we would see an increase in the number of edits by X made to translated articles, resulting in an improved state of the overall article. This targets the low activity on translated articles post-creation, as seen in our most recent research & analysis. It not only focuses on checking source sections that have undergone significant changes, but also checks target sections that are shorter or less comprehensive than their source counterparts. This "substantial content deficit" indicates a clear need for expansion and integration of missing information. By directing editors' attention to these sections, we aim to encourage targeted edits that contribute to higher quality by increasing the overall target article's length, adding relevant wikilinks, images, and references.

Event Timeline

Change #1160729 had a related patch set uploaded (by Reedy; author: Reedy):

[mediawiki/core@master] Add tracking of password validation usage

https://gerrit.wikimedia.org/r/1160729

matmarex subscribed.

(Unrelated patch, tagged by mistake)