Page MenuHomePhabricator

CAT should include a way to mark intervening whitespace between two segments to be discarded
Closed, DeclinedPublicFeature

Description

Feature summary (what you would like to be able to do and where):
A feature should be created so that translators can mark a segment so that intervening whitespace between the current and the *next* segment will be discarded when translated content is generated.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
For example, on the Participate page of Wikimedia Hackathon 2023 ( https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2023/Participate/yue ) we find the following two adjacent strings:

  • The event attendance is '''free of charge'''.
  • Participants are expected to '''take care of their own travel and accommodation''', unless [stuff deleted].

In English, both segments are complete sentences so they both end with a period. In Chinese, however, the first sentence does not constitute a complete *thought*, so it should end with a *comma*, not a period. In some CJK languages (e.g., yue) , on Wikimedia projects this comma is conventionally an unspaced full-width comma. But because the source text was written for European languages, a space has been inserted between the two segments. When the translated text is generated this results in a spaced full-width comma, which is incorrect.

Benefits (why should this be implemented?):
Providing a way to delete the unwanted space will help translated content to have more correct typography in some languages (specifically CJK, possibly others).

Event Timeline

Al12si renamed this task from CAT should include a way to mark initial whitespace for the *next* segment segment for discard to CAT should include a way to mark intervening whitespace between two segments to be discarded.Dec 20 2022, 8:18 AM
Al12si updated the task description. (Show Details)

(What is "CAT"?)

CAT is “computer-aided translation”. It’s how translators refer to any software that segments source text for translation and has a TM.

Nikerabbit subscribed.

Our recommendation is to not split each sentence into separate unit, but to leave full paragraphs. I recommend speaking to translation admins to that wiki about this issue.