Page MenuHomePhabricator

Follow-up analysis of factors affecting deletion rate of translated articles
Closed, ResolvedPublic

Description

Description

After exploring T356765: Correlation between article length, number of translations within a time period, experience of users, and deletion rate., the analysis uncovered that there were several factors affecting article deletion rate.
The LPL team would like to investigate and explore other dimensions that could affect the deletion rate of translated articles:

1. Editorial Experience
  • Editing + Translation experiences (as separate capabilities)
  • New translator and Newcomer editor
  • New translator but an Experienced editor
  • Experienced translator and Experienced editor
  • Experienced translator but Newcomer editor
  • Edit & Translation History
  • deleted, unfinished, unreverted articles
  • Multi-lingual status of editors/ translators (how many languages they have translated to/from)
2. Topic Dimensions
3. Article Quality Criteria

Rationale

With the team focusing on the Content Coverage KR:2.1/2/5 in FY 24/25, efforts to support editors with tools/features that close topic gaps must not compromise content quality levels. We intend to use the insights to determine:
-how we can improve existing translation suggestions for better translation outcomes
-how we can offer additional translation guidance within the CX workflow
-whether existing content quality checks can be introduced in the CX workflow like the Edit Check

Definitions of translation experience levels are yet to be defined; New Translator, Experienced Translator

Event Timeline

In addition to understanding whether the number of references (at least 4) has an impact on deletion, could we look at whether the language of the reference has an impact?

@PWaigi-WMF Thanks for creating the task. I have a few questions and suggestions to scope the analysis.

  • Under "Editorial Experience"
    • "Edit & Translation History" -> deleted, unfinished, unreverted articles: is the goal to understand how users' previous deleted, unfinished and unreverted articles and how that impacts deletion outcome?
    • "Multi-lingual status of editors/ translators": do we want to understand how being a multilingual editor impacts the deletion outcome?
  • Under "Topic Dimensions"
  • Article Quality Criteria
    • Distribution of deleted articles that did not meet the 6 article quality criteria (ranked from highest to lowest)
      • let's split this into a separate task (as sub-task) - as this is slightly different from the primary goal of this task
      • I also believe that we want to understand how each of the 6 criteria impact the deletion outcome - is that right? That will potentially help us with understand the feasibility of Edit Check (it can be part of this task).
    • I can potentially look at pageviews, but let's park "article badges (featured, good)" & "warning templates" for now. The issue with that there is no structured data available and these are not standardized across languages. The only way to gather this data would be to parse wikitext, which is very time consuming and may not add a lot of value. The standard quality criteria has been developed to address this exact problem and it is the standard for understanding article quality across Wikipedias.

We're looking at a lot of variables at once, with some of them requiring very extensive data gathering (as there is no structured data available), which will extend the overall time required for the analysis. Let's start with the ones we have clear indication of the impact from previous analyses and research (such as standard quality and translation experience), and we can do another iteration based on what we learn from this one.

In addition to understanding whether the number of references (at least 4) has an impact on deletion, could we look at whether the language of the reference has an impact?

@FRomeo_WMF that's a good point. We are looking at variables at the time of article creation (translated via CX), and how each of them can help us predict if that article is likely going to be deleted or not. So at the time of creation, an article can have multiple references, are you thinking about the language of references of all the references present? Unfortunately, there is no structured data readily available to understand this, but I will see what I can find during the analysis. This can probably be a follow-up, once we understand the impact of references itself.

KCVelaga_WMF changed the task status from Open to In Progress.Mar 28 2025, 9:24 AM
KCVelaga_WMF claimed this task.
KCVelaga_WMF triaged this task as Medium priority.
KCVelaga_WMF moved this task from Incoming to In progress on the LPL Analytics board.
KCVelaga_WMF moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

Thanks for considering this as a follow-on task, @KCVelaga_WMF.

Different language Wikipedias have different policies for other-language references. This is the policy for English Wikipedia. I am curious whether translating an article into a new language but retaining references in the source language has an impact on the deletion rate. This could vary significantly across language versions, depending on their policy, or whether they're even at the stage where they have a local policy. It could also matter whether the original references were in English, which is sometimes accepted as a lingua franca. Perhaps we could just look at whether changing references (i.e. same/similar number but different references) has an impact. But I understand that this is something that happens after article creation, so probably doesn't belong in this task.

Note: I split the "Distribution of deleted articles that did not meet the 6 article quality criteria (ranked from highest to lowest)" to a subtask (T390519) as it is a substantial task on its own.

  • Under "Editorial Experience"
    • "Edit & Translation History" -> deleted, unfinished, unreverted articles: is the goal to understand how users' previous deleted, unfinished and unreverted articles and how that impacts deletion outcome?
    • "Multi-lingual status of editors/ translators": do we want to understand how being a multilingual editor impacts the deletion outcome?
  • Yes for the edit & translation history.
  • For the Multi-lingual status; not necessarily how it impacts deletion. I thought since we'll be looking into their translation and edit history, this insight would help feed into our body of knowledge regarding CX users.
  • Yes; we can remove this part since it was covered in the other report.
  • Article Quality Criteria
    • Distribution of deleted articles that did not meet the 6 article quality criteria (ranked from highest to lowest)
      • let's split this into a separate task (as a sub-task) - as this is slightly different from the primary goal of this task
      • I also believe that we want to understand how each of the 6 criteria impacts the deletion outcome - is that right? That will potentially help us with understand the feasibility of Edit Check (it can be part of this task).
  • Makes sense to split into 2. And yes we want to see which ones impact deletion and if existing mitigations can work for CX.
  • I can potentially look at pageviews, but let's park "article badges (featured, good)" & "warning templates" for now. The issue with that there is no structured data available and these are not standardized across languages. The only way to gather this data would be to parse wikitext, which is very time consuming and may not add a lot of value. The standard quality criteria has been developed to address this exact problem and it is the standard for understanding article quality across Wikipedias.

We're looking at a lot of variables at once, with some of them requiring very extensive data gathering (as there is no structured data available), which will extend the overall time required for the analysis. Let's start with the ones we have a clear indication of the impact from previous analyses and research (such as standard quality and translation experience), and we can do another iteration based on what we learn from this one.

  • Makes sense; we can leave the scope to only cover pageviews for now.

With what we've already learned with T356765 and T383868 work, plus additional digging into the editorial experience and article quality, we would be able to surface more opportunities to improve the tool.

@KCVelaga_WMF Checking if this is possible within the scope of this ticket.

  • Deletion patterns on topic areas outside Biographies/human subjects.

Analysis summary

In 2024, we did the first iteration an analysis to understand various factors that affect the deletion rate of articles created through the Content Translation (CX) tool. The analysis revealed several interesting insights, most notably that articles meeting the Standard Quality Criteria were almost never deleted. While that is a useful insight, there was interest to further understand the reasons, further breaking down some of the previous factors.

This is Part-B of the follow-up analysis, in Part-A, distribution of translated articles (through CX) across the six standard quality criteria was analysed. The goal of this analysis is to identify which factors are most important in influencing the probability of deletion and they affect it. While Part A focused on the current state of articles and potential areas for improvement, the goal of the analysis is to inform potential product improvements for the Content Translation tool, such as offering additional translation guidance within the CX workflow and whether existing content quality checks like Edit Check can be integrated into the CX workflow.

Overall, user experience in creating quality content appears to have more important in predicting the deletion outcome, compared to the standard quality criteria. The number of articles a user has created has the most significant effect on the likelihood of deletion – the higher the number of articles created, the lower the probability of deletion. A notable observation is that experience in article creation is more predictive of deletion outcomes than a user’s total edit count or whether they used the Content Translation (CX) tool.

cx follow-up del rate.png (1×1 px, 54 KB)

Among standard quality indicators, the number of references has the greatest effect—more references are associated with a lower probability of deletion. This is followed by the number of wikilinks. While longer page length also lowers deletion probability, its impact is smaller compared to references and wikilinks. Beyond a certain threshold, the effect may be of diminishing returns. Interestingly, an increase in media and category additions is associated with a higher probability of deletion. While further qualitative evaluation could help explain this trend, within the scope of this analysis, it suggests that encouraging users to add more images or categories—without improving references or page length—is unlikely to reduce the probability of deletion.


Full report is available at https://analytics.wikimedia.org/published/reports/content_translation/cx_del_rate_factors_v2_2025_T389676.html

Posting this from Slack conversation, that will be helpful here:

Some thoughts about both the references and number of articles created by user, and all other factors in general, and how they are better used:

While the analysis can be helpful to understand what is broadly the impact and what is likely to happen if we change something, but when converting these insights for product development or specific features, we shouldn't be looking at the likely impact of each of the factors in isolation. For example, considering an article of only one sentence, if we purely focus on references only, compared the probability of deletion between 1 reference and 50 references, the statistical model likely say the probability will reduce a 1000% (very unlikely to be deleted), but in reality this will most likely get deleted. The reason -- the model is trying to generalize over a million data points and the change is calculating assuming all other factors are held constant.

In practice, we want a balance across the criteria -- page length is important but upto a certain point only. It seems like beyond about 3000-5000 bytes increasing page length may be of diminishing returns (for deletion outcome), but until that point page length is also important. And in reality, it is probably more impact than references, because adding a ton of references to a couple of lines won't likely help. So this is something to think about if a specific feature is being developed, for example, we only don't want to suggest just to keep adding references to a user. To summarize, page length (upto a certain length), wikilinks, sections and references are important, and purely focusing on one of them won't likely help. It seems like media and categories won't have much impact compared the previous ones.

We should also look the first iteration of the analysis in 2024, rather than purely focusing on standard quality. We might want to see what other factors can be leveraged overall.

About user experience -- we can think more about how this can a product implication, but the analysis is trying to say is, user experience in creating article is more important (it doesn't matter if they gained that experience using CX or otherwise). We were trying to understand the difference between experienced editors vs. experienced translators etc. One of the ideas could be more checks for a newcomer, rather than suggesting to create more articles.