Page MenuHomePhabricator

Dash as html entity (–) generates noise in Apertium translation
Open, MediumPublic

Description

When translating Stanislas-Étienne Meunier from English to Spanish, the first text paragraph has an issue when it is added to the translation using Apertium.

The original article contains the birth and death dates separated by a dash. When it is added to the translation using Apertium, an additional "@" sign is added before the dash:

Screenshot 2020-04-24 at 13.41.33.png (81×827 px, 26 KB)

The "@" is used as a prefix by Apertium to mark elements that cannot translate. Checking on Apertium web, the system seems to have no problem with the dash. However, inspecting the source article it uses the – HTML entity to represent the dash :

(18 July 1843 – 29 April 1925)

This may be adding some noise in the usual process with Apertium (convert html to plain text, send to Apertium for translation, and convert the translation back into html).