Page MenuHomePhabricator

DiscussionTools doesn't work in the Latin variant on Serbian Wikipedia (language variant conversion affects signatures)
Closed, ResolvedPublic

Description

DiscussionTools doesn't work in the Latin variant on Serbian Wikipedia. Probably because the transliteration affects month names in the signatures, so we don't detect them.

Example page: https://sr.wikipedia.org/w/index.php?title=Википедија:Трг/Техника&variant=sr-el&dtenable=1

Event Timeline

Aklapper renamed this task from DiscussionTools doesn't work in the Latin variant on Serbian Wikipedia to DiscussionTools doesn't work in the Latin variant on Serbian Wikipedia (transliteration affects month names in signatures).Aug 7 2020, 7:43 AM

This is a bigger problem than just month names, depending on the language. Serbian turns out to be of the simpler cases.

But in other languages with language converters, any part of the signature can differ. For example, in Inuktitut (iu), the timezone name is also transliterated (ike-cans/ike-latn), and in Kazakh (kk), the numerals in dates/times are also transliterated (kk-latn/kk-arab).

The date format itself looks like it'll be the most annoying thing. For each language, we have something like "H:i, j. F Y." to describe the date format, and the Latin letters have special meaning (e.g. "H" = hour, two-digit, 24-hour format). If we run this through the language converter, these special characters may be converted to another alphabet and lose their meaning. But we need to convert the other non-special characters like "," (e.g. Kazakh converts the Latin comma to "،" in the Arabic variant).

matmarex renamed this task from DiscussionTools doesn't work in the Latin variant on Serbian Wikipedia (transliteration affects month names in signatures) to DiscussionTools doesn't work in the Latin variant on Serbian Wikipedia (language variant conversion affects signatures).Aug 27 2020, 5:19 PM

Change 624273 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/DiscussionTools@master] [WIP] Parsing discussions converted to language variants

https://gerrit.wikimedia.org/r/624273

Change 625743 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/DiscussionTools@master] Add integration tests using pages from sr.wp

https://gerrit.wikimedia.org/r/625743

I've noticed two more interesting problems when testing with content from Serbian Wikipedia:


Problem 1

The timezone abbreviation "CET" / "CEST" should not be converted to Cyrillic. However, I discovered that the only reason this works is because the Serbian language converter has a unique feature to detect whether the text is already in the right alphabet, and skip the conversion if so. Normal discussion pages are primarily written in Cyrillic, so this works fine:

It was, however, converted when I added a comment with a signature in my empty sandbox page:

I think this can be fixed properly by adding "CET" and "CEST" to the conversion tables to ensure they are not converted (like https://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hant), but this feature is currently not in use on sr.wp and I don't really understand it.


Problem 2

On https://sr.wikipedia.org/sr-el/Википедија:Трг/Архива/Техника/35, there are several signatures and several full comments which are not converted to the Latin alphabet.

(search for "април" to find them)

image.png (1×1 px, 290 KB)

I couldn't figure out why this is happening, and I don't know if it's a common problem.


If these problems can't be fixed in the language converter, then I'll need to make the discussion parser more lenient (basically, to allow detecting comments on pages that mix multiple languages). My initial approach (implemented above) assumed that things are converted more consistently than they apparently are…

We talked in the meeting today and it looks like I need to make those changes to the parser after all. I'll probably come back to this next week.

Change 624273 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@master] Parsing discussions converted to language variants

https://gerrit.wikimedia.org/r/624273

Change 625743 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@master] Add integration tests using pages from sr.wp

https://gerrit.wikimedia.org/r/625743