Page MenuHomePhabricator

List top Wikipedias with high deletion ratios for Content Translation
Closed, DuplicatePublic

Description

Although for most cases, the articles created with Content Translation have lower deletion ratios compared to new articles created from scratch (During 2019: 5% vs. 11% across all languages), for certain languages the story can be different.

In some cases, the articles created with Content Translation may be less likely to survive than those created from scratch. For example, for Indonesian (T219851#5914691) and Telugu (T244769) the deletion ratios for Content Translation were higher compared to other articles created in these wikis. These cases can be addressed by adjusting the translation limits, but we don't have a systematic way to identify such cases until editors report them.

This ticket proposes to generate a list of Wikipedias showing their deletion ratios for new articles created with and without Content Translation. The list will surface the wikis where the deletion ratio for Content Translation is higher than usual. The measurement should capture a long-enough period of time to allow for editors to review content and avoid seasonality.

This can be supported by a query with the following information:

LanguageNew CX articlesNew non-CX articlesDeleted CX articlesDeleted non-CX articlesDeleted CX %Deleted non-CX %Deletion % difference (scratch - CX)
TE............22%15%-7%

(Results could be ordered by the "Deletion ratio difference" to identify those cases with larger gaps)


Indicidual queries wre created for generating this report.

Event Timeline

Pginer-WMF triaged this task as Medium priority.Feb 27 2020, 9:17 AM
Pginer-WMF moved this task from Backlog to Priority Backlog on the Language-analytics board.

As a comment, the deletion itself may be a biased metric. Different communities has different approachs regarding deletionism vs inclusionism (e. g. how to deal with problematic articles). The % of deletions will show if there is any problem only in projects that answer to bad articles by immediate deletion but will underestimate the problems in projects that move them to workshops or mark them with templates for improvements (something that may correlate with projects lacking adminship manpower or focusing in growth vs quality)

This should be complemented at least by the % marked with https://www.wikidata.org/wiki/Q7663977#sitelinks-wikipedia for a basic view of the perceived quality of the tool output. For example, in the project I usually work there is a significant backlog (>4000 articles) of recognized bad translations. Moving publishing thresholds such as the article is not deleted but just marked as bad should be avoided.

Ideally, the % of published translations that received significant editing (let's say more than 3 edition or more than 10% of the article bytes) in the first days (let's say a week) after publishing should be recorded. That would capture a third profile: mature projects with enough manpower to actually correct translation problems instead of deleting them or triaging them for a better moment. For the tool to be fully deployed as features instead of beta, there should not be seen as a source of maintenance work for the communities.

As a comment, the deletion itself may be a biased metric. Different communities has different approachs regarding deletionism vs inclusionism (e. g. how to deal with problematic articles). The % of deletions will show if there is any problem only in projects that answer to bad articles by immediate deletion but will underestimate the problems in projects that move them to workshops or mark them with templates for improvements (something that may correlate with projects lacking adminship manpower or focusing in growth vs quality)

Thanks for the input @FAR. Deletions are not a perfect metric, they are just one way that we can find "some" wikis where the system may not be working as expected as early as possible. We try to get direct input from the communities but that may lead to discovering some issues late and not hearing from some communities. With more than 300 wikis we need some metrics to know where to focus our attention.

You indicated a relevant factor: how prone is a community to keep vs. delete. In any case, I'd expect that inclination to delete (or lack of) to affect both new articles created with Content translation, and those created without using the tool. So maybe we can take a look to those wikis where, even with a small deletion rate, there is a significant difference in relative terms. For example, a wiki with 1% of deletions in general but 3% for Content Translation would represent a wiki that is not prone to delete in general but Content Translation articles have 3x deletions.

This should be complemented at least by the % marked with https://www.wikidata.org/wiki/Q7663977#sitelinks-wikipedia for a basic view of the perceived quality of the tool output. For example, in the project I usually work there is a significant backlog (>4000 articles) of recognized bad translations.

That seems a useful template. I'll check if there is a simple way to get the intersection of articles created with Content Translation containing such template for the 41 Wikipedias where it is available.

Moving publishing thresholds such as the article is not deleted but just marked as bad should be avoided.

We can adjust the publishing thresholds to either warn the user, or prevent from publishing. I'm not sure which approach are you recommending.

Ideally, the % of published translations that received significant editing (let's say more than 3 edition or more than 10% of the article bytes) in the first days (let's say a week) after publishing should be recorded. That would capture a third profile: mature projects with enough manpower to actually correct translation problems instead of deleting them or triaging them for a better moment. For the tool to be fully deployed as features instead of beta, there should not be seen as a source of maintenance work for the communities.

I agree that it is desirable that new articles continue to be edited after creation in general. However, I'm not sure it is ok to add barriers for editors to access the best tools for creating articles when that does not happen. I don't think we should move the regular Wikitext/Visual editor to beta for a wiki where the new pages created with those tools don't get edited quick/significantly enough after their creation. Similarly, Content Translation, with all the limitations of any tool, has proven useful for those editors creating articles by translating contents to reuse efforts from other communities.

Thanks for the answer Pau. What I was trying to point is that this ticket seems to search a better metric to understand where the Translation tool is working and where it isn't working. Aggregated data suggest it is already useful (since the survival rate of the created article is comparable with new ones) but the task description properly pointed that it may hid local problem. E. g. Es <-> ca translations are very close and usually produce aceptable output, but it seem indonesian and telugu languages have specific problems (the available automatic translators may be worse, the templates/style may be different and thus messed more often, etc). Having desaggregated data is needed to isolate those problem or you'll get crazy checking the hundreds of languages supported by the foundation

However, a potential problem arising from that is "studying for the test" or "overfitting". Let's say you adapt the threshold of minimum % of edited content as first approach. Simple example: You start from the initial data, identify projects with above-average deletion rate and reduce the maximum allowed not edited content until the survival rates tend to the average. My first concern is that it may lead to an evolutionary arms race. There will be a cost for having a too low threshold (lot of complaints about "why can't I publish my article") so there will be pressure to have articles with the lowest viable quality (that that allows survival despite flaws)

My second concern, its that the deletion metric will aim the language team towards supporting deletionist projects more. An inclusionist project will deal with bad translations by marking it but not deleting so there will not be feedback from them under this task. However, the desaggregated data will point problems with languages whose community deletes automatic translated articles and will likely generate new tasks/sprints. That may bias the information backfeeded into the team and thus its answer.

I was trying to explain that this task should be paralelled (or extended) with additional metrics to be really useful (= giving the team full context to triage and schedule the next steps). If you get a pass/no pass metric for quality assurance, the trend will likely be towards an optimal system design for minimum acceptable quality whilst if a cuantitative approach is used, the system design will tend towards an optimal quality/effort ratio design. I guess Taguchi may explain it better: https://en.wikipedia.org/wiki/Taguchi_methods#Taguchi's_use_of_loss_functions (check case 1 against case 3)

I was trying to explain that this task should be paralelled (or extended) with additional metrics to be really useful (= giving the team full context to triage and schedule the next steps). If you get a pass/no pass metric for quality assurance, the trend will likely be towards an optimal system design for minimum acceptable quality whilst if a cuantitative approach is used, the system design will tend towards an optimal quality/effort ratio design. I guess Taguchi may explain it better: https://en.wikipedia.org/wiki/Taguchi_methods#Taguchi's_use_of_loss_functions (check case 1 against case 3)

Thanks for the detailed reply. I agree that those are relevant considerations, and it makes perfect sense to consider additional measurements. Especially to avoid ignoring the more inclusionist wikis where the tool may be generating additional work for reviewers compared to new articles.

Regarding the implications of making limits too strict just based on numbers, it is worth noting that even if we detect some potential problems automatically, the process of adjusting the limits involves a conversation with the community and is defined as an iterative process where we get feedback to correct if the adjustment results in limits that are too restrictive.