Page MenuHomePhabricator

Measure the number of wikis were translations are deleted more often than new articles
Closed, ResolvedPublic

Description

Across all languages, Wikipedia articles created with Content Translation are deleted less often than those created from scratch. For example, in 2020, 3% of new translations were deleted, compared to 12% of other new articles. However, this is not the case for all Wikipedias and some specific wikis have a higher deletion rate for translations. For example, for Indonesian (T219851#5914691) and Telugu (T244769) the deletion ratios for Content Translation were higher compared to other articles created in these wikis. These cases can be addressed by adjusting the translation limits, but we don't have a systematic way to identify such cases until editors report them. Currently we have been using this query to compare the deletion ratios on specific wikis, but checking individual wikis does not provide the whole picture.

This ticket proposes to measure the number of wikis where the deletion rate of translations is higher than the deletion rate for articles created with other tools (excluding bots and focused on main namespace), and list them. The measurement should capture a long-enough period of time to allow for editors to review content and avoid seasonality.

As we make improvements that encourage the creation of better content, we'd like to see the evolution of the deletion rates and how they compare with the deletion rate for regular articles we use as a baseline. For example, finding that by the end of 2021, the number of wikis where translations are deleted more often than other articles was reduced by X% compared to the situation in 2020. In addition, we want to be able to identify which are the wikis with issues and how severe those are in order to communicate with them and/or adjust the MT limits based on the data.

Base don that, these are questions we want to answer:

  • How many wikis have translations deleted more often than regular articles?
  • Has the number of those wikis reduced compared to the previous period?
  • Which are these wikis?
  • How high is the highest deletion ratio a wiki has for translations?

Event Timeline

Pginer-WMF raised the priority of this task from Medium to High.Jul 28 2021, 2:17 PM

@Pginer-WMF

I've completed the initial analysis to measure the number of wikis were translations are deleted more often than new articles.

Please see summary of results below and notebook for further details.

Data: Data comes from mediawiki_history and is restricted to only main namespace articles created and deleted during the reviewed time period. Bots are excluded. Note: There are some slight differences between this data and the data found using https://quarry.wmcloud.org/query/43687, which I believe is partly due to differences in how bots are identified between the replicas and mediawiki_history but I'm investigating and will confirm.

Reviewed Time Periods: I reviewed two possible time periods: quarterly and every 6 months for this initial analysis but time periods can be easily adjusted. Happy to discuss which time period might work best and how frequently we'd like to review these rates over time.

Wiki size threshold: There are a number of smaller wikis where only a few content translation articles were created during the reviewed period. These were removed from the analysis to reduce noise and focus on wikis with more representative data. In the data below, I excluded wikis where 15 or fewer articles were created during the reviewed time period but this threshold can be adjusted if needed.

Quarterly Comparison

Time PeriodNumber of Wikis with higher deletion ratios for cx created articlesWiki with the highest deletion ratio difference
Q4 (April 2021 - June 2021)15 wikis [i, ii]Moroccan Arabic Wikipedia (36% cx; 1.64% non-cx) [iii]
Q1 (Jan 2021 - March 2021)13 wikisHawaiian Wikipedia (39% cx; 1.2% non-cx)

i. There were 3 wikis that had higher deletion ratios for content translated articles for both quarters: kawiki, bewiki, and mrwiki.
ii. Most of these were smaller sized wikis. The only two larger sized wikis (with over 100,000+ articles) on this list are Hindi Wikipedia (hiwiki) and Persian Wikipedia (fawiki). The deletion ratio for cx created articles was less than 1% higher than non-cx articles for both of these wikis in Q4.
iii. Note: With only 25 created translations, this is still a small wiki. The highest deletion ratio for a larger size wiki (with over 100,000+ total articles) is Hindi Wikipedia with a deletion ratio of 24.2% for cx created articles compared to 23.36% for articles created without cx (only a 0.8% difference)

List of Wikis where higher deletion ratios are higher for articles created with Content Translation (April 2021 - June 2021)

cx_deletion_higher_wikis_q4.png (811×1 px, 159 KB)

6 Month Comparison

Time PeriodNumber of Wikis with higher cx deletion ratiosWiki with the highest deletion ratio difference
Jan 2021 - June 202120 wikis [iii]Hawaiian Wikipedia (36.76% cx; 19.53% non-cx)
July 2020 - December 202021 wikisWest Frisian Wikipedia (82% cx; 3.7% non-cx)

iii. There were 8 wikis that had higher deletion ratios for content translated articles both 6 month periods: hawwiki, iswiki, kuwiki, arywiki, arzwiki, fiwiki, lawiki, and eowiki.

List of Wikis where higher deletion ratios are higher for articles created with Content Translation (Jan 2021 - June 2021)

cx_deletion_higher_wikis_6mo.png (1×1 px, 192 KB)

Note: There are some slight differences between this data and the data found using https://quarry.wmcloud.org/query/43687, which I believe is partly due to differences in how bots are identified between the replicas and mediawiki_history but I'm investigating and will confirm.

I discussed this with @nshahquinn-wmf and we determined is likely because the query in https://quarry.wmcloud.org/query/43687 looks at the ratio of deleted articles to surviving (non-deleted) articles instead of the ratio of deleted articles to all created articles. The was corrected in the following query: https://quarry.wmcloud.org/query/53775

I confirmed that the numbers I reported in T286636#7345479 above are consistent with the data found using this corrected query of the replicas.

Thanks @MNeisler for sharing these results (and for confirming the origin of the inconsistencies with the previous query). At first sight the format provided seems very useful. I'll be looking into it in more detail.

As discussed early, I think we can close this ticket since we have the results in a satisfactory way to answer the above questions. I created a follow-up ticket to publish them on wiki: T292164: Publish on wiki the results for the number of wikis were translations are deleted more often than new articles